DataFrame - Fenic

The DataFrame class represents a lazily evaluated computation on data. Operations build up a logical query plan that is only executed when an action is called.

Direct construction using DataFrame() is not allowed. Create DataFrames through Session.create_dataframe() or other Session methods.

Properties

schema

df.schema -> Schema

Get the schema of this DataFrame.

Schema

Schema containing field names and data types

df.schema
# Schema([
#     ColumnField('name', StringType),
#     ColumnField('age', IntegerType)
# ])

columns

df.columns -> List[str]

Get list of column names.

columns

List[str]

List of all column names in the DataFrame

df.columns
# ['name', 'age', 'city']

semantic

df.semantic -> SemanticExtensions

Interface for semantic operations (LLM-powered transformations) on the DataFrame.

SemanticExtensions

Semantic operations interface

# Semantic extraction
df.semantic.extract(
    "content",
    schema=MyPydanticModel,
    model="gpt-4o-mini"
)

write

df.write -> DataFrameWriter

Interface for saving the content of the DataFrame.

DataFrameWriter

Writer interface to write DataFrame to various formats

# Write to Parquet
df.write.parquet("output.parquet")

# Write to CSV
df.write.csv("output.csv")

Selection and Filtering

select

df.select(*cols: ColumnOrName) -> DataFrame

Projects a set of Column expressions or column names.

cols

ColumnOrName

required

Column expressions to select. Can be string column names or Column objects.

DataFrame

A new DataFrame with selected columns

# Select by column names
df.select("name", "age")

# Select with expressions
df.select(col("name"), col("age") + 1)

# Mix strings and expressions
df.select("name", col("age") * 2)

filter / where

df.filter(condition: Column) -> DataFrame
df.where(condition: Column) -> DataFrame

Filters rows using the given condition.

condition

Column

required

A Column expression that evaluates to a boolean

DataFrame

Filtered DataFrame

# Numeric comparison
df.filter(col("age") > 25)

# Multiple conditions
df.filter((col("age") > 25) & (col("age") <= 35))

# String matching
df.filter(col("name").contains("Alice"))

drop

df.drop(*col_names: str) -> DataFrame

Remove one or more columns from this DataFrame.

col_names

str

required

Names of columns to drop

DataFrame

New DataFrame without specified columns

# Drop single column
df.drop("age")

# Drop multiple columns
df.drop("id", "age")

Adding and Modifying Columns

with_column

df.with_column(
    col_name: str,
    col: Union[Any, Column, pl.Series, pd.Series]
) -> DataFrame

Add a new column or replace an existing column.

col_name

str

required

Name of the new column

col

Union[Any, Column, pl.Series, pd.Series]

required

Column expression, Series, or value to assign:

Column: A Column expression
pl.Series or pd.Series: A Series (length must match DataFrame height)
Any other value: Treated as a literal (broadcast to all rows)

DataFrame

New DataFrame with added/replaced column

df.with_column("constant", 1)

with_columns

df.with_columns(
    cols_map: Dict[str, Union[Any, Column, pl.Series, pd.Series]]
) -> DataFrame

Add multiple new columns or replace existing columns.

cols_map

Dict[str, Union[Any, Column, pl.Series, pd.Series]]

required

Dictionary where keys are column names and values are Column expressions, Series, or literal values

DataFrame

New DataFrame with added/replaced columns

df.with_columns({
    "double_age": col("age") * 2,
    "constant": 1,
    "age_plus_10": col("age") + 10
})

All columns are created at once, so new columns cannot depend on each other. Use chained with_column() calls for dependent columns.

with_column_renamed

df.with_column_renamed(col_name: str, new_col_name: str) -> DataFrame

Rename a column. No-op if the column does not exist.

col_name

str

required

Name of the column to rename

new_col_name

str

required

New name for the column

DataFrame

New DataFrame with the column renamed

df.with_column_renamed("age", "age_in_years")

Aggregations and Grouping

group_by

df.group_by(*cols: ColumnOrName) -> GroupedData

Groups the DataFrame using the specified columns.

cols

ColumnOrName

required

Columns to group by. Can be column names as strings or Column expressions.

GroupedData

Object for performing aggregations on the grouped data

# Group by single column
df.group_by("department").agg(count("*"))

# Group by multiple columns
df.group_by("department", "location").agg({"salary": "avg"})

# Group by expression
df.group_by(lower(col("department"))).agg(count("*"))

agg

df.agg(*exprs: Union[Column, Dict[str, str]]) -> DataFrame

Aggregate on the entire DataFrame without groups.

exprs

Union[Column, Dict[str, str]]

required

Aggregation expressions or dictionary of aggregations

DataFrame

Aggregation results

# Multiple aggregations
df.agg(
    count().alias("total_rows"),
    avg(col("salary")).alias("avg_salary")
)

# Dictionary style
df.agg({"salary": "avg", "age": "max"})

Joins and Unions

join

df.join(
    other: DataFrame,
    on: Optional[Union[str, List[str]]] = None,
    *,
    left_on: Optional[Union[ColumnOrName, List[ColumnOrName]]] = None,
    right_on: Optional[Union[ColumnOrName, List[ColumnOrName]]] = None,
    how: JoinType = "inner"
) -> DataFrame

Joins this DataFrame with another DataFrame.

other

DataFrame

required

DataFrame to join with

Union[str, List[str]]

Join condition(s). Can be column name(s), Column expression(s), or None for cross joins

left_on

Union[ColumnOrName, List[ColumnOrName]]

Column(s) from left DataFrame to join on

right_on

Union[ColumnOrName, List[ColumnOrName]]

Column(s) from right DataFrame to join on

how

JoinType

default:"inner"

Type of join: “inner”, “left”, “right”, “outer”, “cross”

DataFrame

Joined DataFrame

df1.join(df2, on="id")

union

df.union(other: DataFrame) -> DataFrame

Return a new DataFrame containing the union of rows in this and another DataFrame.

other

DataFrame

required

Another DataFrame with the same schema

DataFrame

A new DataFrame containing rows from both DataFrames

df1.union(df2)

# Remove duplicates after union
df1.union(df2).drop_duplicates()

This is equivalent to UNION ALL in SQL. To remove duplicates, use drop_duplicates() after union.

Sorting and Limiting

sort / order_by

df.sort(
    cols: Union[ColumnOrName, List[ColumnOrName], None] = None,
    ascending: Optional[Union[bool, List[bool]]] = None
) -> DataFrame
df.order_by(
    cols: Union[ColumnOrName, List[ColumnOrName], None] = None,
    ascending: Optional[Union[bool, List[bool]]] = None
) -> DataFrame

Sort the DataFrame by the specified columns.

cols

Union[ColumnOrName, List[ColumnOrName]]

Columns to sort by. Can include sorting directives like asc(), desc()

ascending

Union[bool, List[bool]]

Boolean or list of booleans indicating sort order

DataFrame

Sorted DataFrame

# Sort ascending
df.sort(col("age").asc())

# Sort descending
df.sort(col("age").desc())

# Boolean parameter
df.sort("age", ascending=False)

# Multiple columns
df.sort(col("age").desc(), col("name").asc())

limit

df.limit(n: int) -> DataFrame

Limits the number of rows to the specified number.

int

required

Maximum number of rows to return

DataFrame

DataFrame with at most n rows

df.limit(10)

Array and Struct Operations

explode

df.explode(column: ColumnOrName) -> DataFrame

Create a new row for each element in an array column.

column

ColumnOrName

required

Name of array column to explode or Column expression

DataFrame

New DataFrame with the array column exploded into multiple rows

df = session.create_dataframe({
    "id": [1, 2],
    "tags": [["red", "blue"], ["green"]]
})

df.explode("tags")
# id | tags
# ---|------
#  1 | red
#  1 | blue
#  2 | green

explode_with_index

df.explode_with_index(
    column: ColumnOrName,
    index_col_name: str = "pos",
    value_col_name: str = "col",
    keep_null_and_empty: bool = False
) -> DataFrame

Create a new row for each element in an array column, with the element’s position and value.

column

ColumnOrName

required

Name of array column to explode

index_col_name

str

default:"pos"

Name for the position column

value_col_name

str

default:"col"

Name for the value column

keep_null_and_empty

bool

default:"False"

If True, preserves rows where the array is null or empty

DataFrame

New DataFrame with position and value columns

df.explode_with_index("tags")
# pos | id | tags
# ----|----|----- 
#   0 |  1 | red
#   1 |  1 | blue
#   0 |  2 | green

unnest

df.unnest(*col_names: str) -> DataFrame

Unnest the specified struct columns into separate columns.

col_names

str

required

One or more struct columns to unnest

DataFrame

A new DataFrame with struct columns expanded

df = session.create_dataframe({
    "id": [1, 2],
    "tags": [{"red": 1, "blue": 2}, {"red": 3}]
})

df.unnest("tags")
# id | red | blue
# ---|-----|-----
#  1 |   1 |    2
#  2 |   3 | null

Deduplication

distinct

df.distinct() -> DataFrame

Return a DataFrame with duplicate rows removed.

DataFrame

A new DataFrame with duplicate rows removed

df.distinct()

drop_duplicates

df.drop_duplicates(subset: Optional[List[str]] = None) -> DataFrame

Return a DataFrame with duplicate rows removed.

subset

List[str]

Column names to consider when identifying duplicates. If not provided, all columns are considered.

DataFrame

A new DataFrame with duplicate rows removed

# Remove duplicates considering all columns
df.drop_duplicates()

# Remove duplicates considering specific columns
df.drop_duplicates(["c1", "c2"])

Caching

persist / cache

df.persist() -> DataFrame
df.cache() -> DataFrame

Mark this DataFrame to be persisted after first computation.

DataFrame

Same DataFrame, but marked for persistence

# Cache intermediate results
filtered_df = (
    df.filter(col("age") > 25)
    .persist()  # Cache these results
)

# Both operations use cached results
result1 = filtered_df.group_by("department").count()
result2 = filtered_df.select("name", "salary")

Actions (Trigger Execution)

show

df.show(n: int = 10, explain_analyze: bool = False) -> None

Display the DataFrame content in a tabular form.

int

default:"10"

Number of rows to display

explain_analyze

bool

default:"False"

Whether to print the explain analyze plan

df.show()
df.show(20)
df.show(10, explain_analyze=True)

collect

df.collect(data_type: DataLikeType = "polars") -> QueryResult

Execute the DataFrame computation and return the result as a QueryResult.

data_type

DataLikeType

default:"polars"

Type of data to return: “polars”, “pandas”, “arrow”, “pydict”, “pylist”

QueryResult

A QueryResult with materialized data and query metrics

result = df.collect()
data = result.data
metrics = result.metrics

to_polars / to_pandas / to_arrow

df.to_polars() -> pl.DataFrame
df.to_pandas() -> pd.DataFrame
df.to_arrow() -> pa.Table

Execute the DataFrame computation and return results in specific format.

result

Union[pl.DataFrame, pd.DataFrame, pa.Table]

Materialized results in the requested format

pl_df = df.to_polars()
pd_df = df.to_pandas()
arrow_table = df.to_arrow()

to_pydict / to_pylist

df.to_pydict() -> Dict[str, List[Any]]
df.to_pylist() -> List[Dict[str, Any]]

Execute the DataFrame computation and return Python data structures.

result

Union[Dict[str, List[Any]], List[Dict[str, Any]]]

Materialized results as Python dictionary or list

# Dictionary of column arrays
data_dict = df.to_pydict()
# {"col1": [1, 2, 3], "col2": ["a", "b", "c"]}

# List of row dictionaries
data_list = df.to_pylist()
# [{"col1": 1, "col2": "a"}, {"col1": 2, "col2": "b"}]

count

df.count() -> int

Count the number of rows in the DataFrame.

count

int

The number of rows in the DataFrame

num_rows = df.count()

Lineage

lineage

df.lineage() -> Lineage

Create a Lineage object to trace data through transformations.

Lineage

Interface for querying data lineage

lineage = df.lineage()

# Trace backwards
source_rows = lineage.backward(["result_uuid1", "result_uuid2"])

# Trace forwards
result_rows = lineage.forward(["source_uuid1"])

Utility Methods

explain

df.explain() -> None

Display the logical plan of the DataFrame.

df.explain()

Aliases

withColumn → with_column
withColumns → with_columns
withColumnRenamed → with_column_renamed
groupBy / groupby → group_by
dropDuplicates → drop_duplicates
orderBy → order_by
unionAll → union

Core

Functions

I/O

Types

Configuration

MCP

​Properties

​schema

​columns

​semantic

​write

​Selection and Filtering

​select

​filter / where

​drop

​Adding and Modifying Columns

​with_column

​with_columns

​with_column_renamed

​Aggregations and Grouping

​group_by

​agg

​Joins and Unions

​join

​union

​Sorting and Limiting

​sort / order_by

​limit

​Array and Struct Operations

​explode

​explode_with_index

​unnest

​Deduplication

​distinct

​drop_duplicates

​Caching

​persist / cache

​Actions (Trigger Execution)

​show

​collect

​to_polars / to_pandas / to_arrow

​to_pydict / to_pylist

​count

​Lineage

​lineage

​Utility Methods

​explain

​Aliases

Build docs developers (and LLMs) love

Properties

schema

columns

semantic

write

Selection and Filtering

select

filter / where

drop

Adding and Modifying Columns

with_column

with_columns

with_column_renamed

Aggregations and Grouping

group_by

agg

Joins and Unions

join

union

Sorting and Limiting

sort / order_by

limit

Array and Struct Operations

explode

explode_with_index

unnest

Deduplication

distinct

drop_duplicates

Caching

persist / cache

Actions (Trigger Execution)

show

collect

to_polars / to_pandas / to_arrow

to_pydict / to_pylist

count

Lineage

lineage

Utility Methods

explain

Aliases