Overview
Fenic provides a rich type system designed for both traditional data processing and AI workloads. Types are used for schema definition, casting, validation, and optimization throughout the DataFrame API.
Type Categories
Fenic types fall into three main categories:
Primitive Types : Basic types like strings, integers, and booleans
Composite Types : Arrays and structs containing other types
Logical Types : Specialized types for AI workloads (embeddings, markdown, JSON, etc.)
Primitive Types
StringType
Represents UTF-8 encoded string values.
from fenic.core.types import StringType
# Used in schema definitions
StructField( "name" , StringType)
# Cast to string
df.with_column( "str_value" , col( "number" ).cast(StringType))
IntegerType
Represents signed integer values.
from fenic.core.types import IntegerType
StructField( "age" , IntegerType)
df.with_column( "int_value" , col( "string_number" ).cast(IntegerType))
FloatType
Represents 32-bit floating-point numbers.
from fenic.core.types import FloatType
StructField( "score" , FloatType)
df.with_column( "float_value" , col( "decimal_str" ).cast(FloatType))
DoubleType
Represents 64-bit floating-point numbers (higher precision than float).
from fenic.core.types import DoubleType
StructField( "precise_value" , DoubleType)
df.with_column( "double_value" , col( "high_precision" ).cast(DoubleType))
BooleanType
Represents boolean True/False values.
from fenic.core.types import BooleanType
StructField( "is_active" , BooleanType)
df.with_column( "bool_value" , col( "flag" ).cast(BooleanType))
DateType
Represents date values (year, month, day).
from fenic.core.types import DateType
StructField( "birth_date" , DateType)
df.with_column( "date_value" , col( "date_string" ).cast(DateType))
TimestampType
Represents timestamp values with date and time.
from fenic.core.types import TimestampType
StructField( "created_at" , TimestampType)
df.with_column( "ts_value" , col( "timestamp_str" ).cast(TimestampType))
Composite Types
ArrayType
Represents a homogeneous variable-length array (list) of elements.
The data type of each element in the array
Array of Strings
Array of Integers
Nested Arrays
from fenic.core.types import ArrayType, StringType
# Define schema with array
ArrayType(StringType)
ArrayType( element_type = StringType)
# Example: tags column
StructField( "tags" , ArrayType(StringType))
Working with Arrays
from fenic.api.functions import col, array
# Create array column
df.with_column( "tags" , array([ "tag1" , "tag2" , "tag3" ]))
# Explode array into rows
df.select( "id" , col( "tags" ).explode())
# Array length
df.with_column( "tag_count" , col( "tags" ).list.len())
# Access array element
df.with_column( "first_tag" , col( "tags" ).list.get( 0 ))
StructType
Represents a struct (record) with named fields.
struct_fields
List[StructField]
required
List of field definitions (name and type pairs)
Simple Struct
Nested Struct
from fenic.core.types import StructType, StructField, StringType, IntegerType
address_type = StructType([
StructField( "street" , StringType),
StructField( "city" , StringType),
StructField( "zip_code" , IntegerType)
])
StructField( "address" , address_type)
Working with Structs
from fenic.api.functions import col
# Access struct field
df.select(col( "address" ).struct.field( "city" ))
# Unnest struct into separate columns
df.unnest( "address" ) # Creates: address_street, address_city, address_zip_code
# Create struct from columns
from fenic.api.functions import struct
df.with_column(
"location" ,
struct([col( "latitude" ), col( "longitude" )])
)
Logical Types
Logical types are specialized string types that preserve semantic meaning for AI operations.
EmbeddingType
Represents a fixed-length embedding vector.
Number of dimensions in the embedding vector
Name of the model used to generate the embedding
from fenic.core.types import EmbeddingType
# OpenAI text-embedding-3-small (1536 dimensions)
EmbeddingType( 1536 , embedding_model = "text-embedding-3-small" )
# Cohere embed-v4 (1024 dimensions)
EmbeddingType( 1024 , embedding_model = "embed-v4.0" )
Generating Embeddings
from fenic.api.functions import semantic
# Generate embeddings
df = df.with_column(
"text_embeddings" ,
semantic.embed(col( "text_column" ))
)
# The column will have EmbeddingType(dimensions, model)
# based on your configured embedding model
MarkdownType
Represents a string containing Markdown-formatted text.
from fenic.core.types import MarkdownType
StructField( "document" , MarkdownType)
Use Cases
Storing formatted documentation
Output from PDF parsing
Rich text content for LLM processing
from fenic.api.functions import semantic
# Parse PDF to markdown
df = df.with_column(
"markdown_content" ,
semantic.parse_pdf(col( "pdf_path" ))
)
# Result column has MarkdownType
HtmlType
Represents a string containing raw HTML markup.
from fenic.core.types import HtmlType
StructField( "webpage_content" , HtmlType)
JsonType
Represents a string containing valid JSON data.
from fenic.core.types import JsonType
StructField( "api_response" , JsonType)
Working with JSON
from fenic.api.functions import col
# Parse JSON string to struct
df.with_column(
"parsed" ,
col( "json_column" ).str.json_extract()
)
# Convert to JSON string
df.with_column(
"json_str" ,
col( "struct_column" ).to_json()
)
TranscriptType
Represents a string containing a transcript in a specific format.
format
Literal['generic', 'srt', 'webvtt']
required
The transcript format
from fenic.core.types import TranscriptType
# Generic transcript format
TranscriptType( format = "generic" )
StructField( "transcript" , TranscriptType( format = "generic" ))
# SubRip (.srt) subtitle format
TranscriptType( format = "srt" )
StructField( "subtitles" , TranscriptType( format = "srt" ))
# WebVTT (.vtt) subtitle format
TranscriptType( format = "webvtt" )
StructField( "captions" , TranscriptType( format = "webvtt" ))
DocumentPathType
Represents a string containing a document’s local (file system) or remote (URL) path.
from fenic.core.types import DocumentPathType
# PDF path type
DocumentPathType( format = "pdf" )
StructField( "pdf_path" , DocumentPathType( format = "pdf" ))
Type Inspection
Getting DataFrame Schema
# Get schema
schema = df.schema
print (schema)
# Schema([
# ColumnField('name', StringType),
# ColumnField('age', IntegerType),
# ColumnField('tags', ArrayType(StringType))
# ])
# Print formatted schema
df.print_schema()
# root
# |-- name: StringType
# |-- age: IntegerType
# |-- tags: ArrayType(StringType)
# Get column names
columns = df.columns # ['name', 'age', 'tags']
# Get specific field type
name_field = schema.field( "name" )
print (name_field.data_type) # StringType
Type Checking
from fenic.core.types import StringType, IntegerType, ArrayType
# Check if types match
StringType == StringType # True
StringType == IntegerType # False
# Check array element type
array_type = ArrayType(StringType)
array_type.element_type == StringType # True
# Check struct fields
from fenic.core.types import StructType, StructField
struct_type = StructType([
StructField( "name" , StringType),
StructField( "age" , IntegerType)
])
struct_type.struct_fields[ 0 ].name # "name"
struct_type.struct_fields[ 0 ].data_type # StringType
Type Casting
from fenic.api.functions import col
from fenic.core.types import StringType, IntegerType, DoubleType
# Cast to string
df.with_column( "age_str" , col( "age" ).cast(StringType))
# Cast to integer
df.with_column( "age_int" , col( "age_str" ).cast(IntegerType))
# Cast to double
df.with_column( "score_double" , col( "score" ).cast(DoubleType))
Schema Definition
Define explicit schemas for reading data:
from fenic.core.types import (
Schema,
ColumnField,
StringType,
IntegerType,
ArrayType,
StructType,
StructField
)
schema = Schema([
ColumnField( "id" , IntegerType),
ColumnField( "name" , StringType),
ColumnField( "tags" , ArrayType(StringType)),
ColumnField( "metadata" , StructType([
StructField( "created_at" , StringType),
StructField( "version" , IntegerType)
]))
])
df = session.read.csv( "data.csv" , schema = schema)
Type Inference
Fenic automatically infers types when reading data:
# From CSV - infers types from data
df = session.read.csv( "data.csv" )
# From Parquet - uses Parquet schema
df = session.read.parquet( "data.parquet" )
# From dictionary - infers from Python types
df = session.create_dataframe({
"name" : [ "Alice" , "Bob" ], # -> StringType
"age" : [ 25 , 30 ], # -> IntegerType
"score" : [ 95.5 , 87.3 ], # -> DoubleType
"active" : [ True , False ] # -> BooleanType
})
Best Practices
Use logical types for semantic meaning
Logical types preserve semantic information for LLM operations: # Good: Preserves type information
StructField( "document" , MarkdownType)
StructField( "embeddings" , EmbeddingType( 1536 , "text-embedding-3-small" ))
# Avoid: Loses semantic meaning
StructField( "document" , StringType)
Be explicit about numeric precision
Choose appropriate numeric types based on precision needs: # For precise calculations
StructField( "price" , DoubleType) # 64-bit precision
# For memory efficiency
StructField( "count" , IntegerType)
StructField( "ratio" , FloatType) # 32-bit sufficient
Define schemas for complex data
Explicit schemas prevent inference errors: # Good: Explicit schema
schema = Schema([
ColumnField( "id" , IntegerType),
ColumnField( "metadata" , StructType([
StructField( "tags" , ArrayType(StringType))
]))
])
df = session.read.json( "data.json" , schema = schema)
Use appropriate types for embeddings
Match embedding dimensions to your model: # text-embedding-3-small: 1536 dimensions
EmbeddingType( 1536 , "text-embedding-3-small" )
# text-embedding-3-large: 3072 dimensions
EmbeddingType( 3072 , "text-embedding-3-large" )
# Cohere embed-v4: 1024 dimensions
EmbeddingType( 1024 , "embed-v4.0" )
Type Compatibility
Numeric Type Hierarchy
IntegerType
↓ (can cast to)
FloatType
↓ (can cast to)
DoubleType
String to Other Types
# String can be cast to most types
StringType → IntegerType # "123" → 123
StringType → DoubleType # "123.45" → 123.45
StringType → BooleanType # "true" → True
StringType → DateType # "2024-01-01" → Date(2024, 1, 1)
Logical Types
Logical types are specialized strings - they can be used anywhere strings are accepted, but preserve additional semantic meaning:
# These are all string-based
MarkdownType # String with markdown formatting
JsonType # String with JSON content
HtmlType # String with HTML content
TranscriptType # String with transcript content
Next Steps
DataFrames Work with typed data in DataFrames
Semantic Operators Use logical types with LLM operations