The DataFrame class represents a lazily evaluated computation on data. Operations build up a logical query plan that is only executed when an action is called.
Direct construction using DataFrame() is not allowed. Create DataFrames through Session.create_dataframe() or other Session methods.
Properties
schema
Get the schema of this DataFrame.
Schema containing field names and data types
df.schema
# Schema([
# ColumnField('name', StringType),
# ColumnField('age', IntegerType)
# ])
columns
Get list of column names.
List of all column names in the DataFrame
df.columns
# ['name', 'age', 'city']
semantic
df.semantic -> SemanticExtensions
Interface for semantic operations (LLM-powered transformations) on the DataFrame.
Semantic operations interface
# Semantic extraction
df.semantic.extract(
"content" ,
schema = MyPydanticModel,
model = "gpt-4o-mini"
)
write
df.write -> DataFrameWriter
Interface for saving the content of the DataFrame.
Writer interface to write DataFrame to various formats
# Write to Parquet
df.write.parquet( "output.parquet" )
# Write to CSV
df.write.csv( "output.csv" )
Selection and Filtering
select
df.select( * cols: ColumnOrName) -> DataFrame
Projects a set of Column expressions or column names.
Column expressions to select. Can be string column names or Column objects.
A new DataFrame with selected columns
# Select by column names
df.select( "name" , "age" )
# Select with expressions
df.select(col( "name" ), col( "age" ) + 1 )
# Mix strings and expressions
df.select( "name" , col( "age" ) * 2 )
filter / where
df.filter(condition: Column) -> DataFrame
df.where(condition: Column) -> DataFrame
Filters rows using the given condition.
A Column expression that evaluates to a boolean
# Numeric comparison
df.filter(col( "age" ) > 25 )
# Multiple conditions
df.filter((col( "age" ) > 25 ) & (col( "age" ) <= 35 ))
# String matching
df.filter(col( "name" ).contains( "Alice" ))
drop
df.drop( * col_names: str ) -> DataFrame
Remove one or more columns from this DataFrame.
New DataFrame without specified columns
# Drop single column
df.drop( "age" )
# Drop multiple columns
df.drop( "id" , "age" )
Adding and Modifying Columns
with_column
df.with_column(
col_name: str ,
col: Union[Any, Column, pl.Series, pd.Series]
) -> DataFrame
Add a new column or replace an existing column.
col
Union[Any, Column, pl.Series, pd.Series]
required
Column expression, Series, or value to assign:
Column: A Column expression
pl.Series or pd.Series: A Series (length must match DataFrame height)
Any other value: Treated as a literal (broadcast to all rows)
New DataFrame with added/replaced column
Literal Value
Computed Column
From Series
Complex Expression
df.with_column( "constant" , 1 )
with_columns
df.with_columns(
cols_map: Dict[ str , Union[Any, Column, pl.Series, pd.Series]]
) -> DataFrame
Add multiple new columns or replace existing columns.
cols_map
Dict[str, Union[Any, Column, pl.Series, pd.Series]]
required
Dictionary where keys are column names and values are Column expressions, Series, or literal values
New DataFrame with added/replaced columns
df.with_columns({
"double_age" : col( "age" ) * 2 ,
"constant" : 1 ,
"age_plus_10" : col( "age" ) + 10
})
All columns are created at once, so new columns cannot depend on each other. Use chained with_column() calls for dependent columns.
with_column_renamed
df.with_column_renamed(col_name: str , new_col_name: str ) -> DataFrame
Rename a column. No-op if the column does not exist.
Name of the column to rename
New DataFrame with the column renamed
df.with_column_renamed( "age" , "age_in_years" )
Aggregations and Grouping
group_by
df.group_by( * cols: ColumnOrName) -> GroupedData
Groups the DataFrame using the specified columns.
Columns to group by. Can be column names as strings or Column expressions.
Object for performing aggregations on the grouped data
# Group by single column
df.group_by( "department" ).agg(count( "*" ))
# Group by multiple columns
df.group_by( "department" , "location" ).agg({ "salary" : "avg" })
# Group by expression
df.group_by(lower(col( "department" ))).agg(count( "*" ))
agg
df.agg( * exprs: Union[Column, Dict[ str , str ]]) -> DataFrame
Aggregate on the entire DataFrame without groups.
exprs
Union[Column, Dict[str, str]]
required
Aggregation expressions or dictionary of aggregations
# Multiple aggregations
df.agg(
count().alias( "total_rows" ),
avg(col( "salary" )).alias( "avg_salary" )
)
# Dictionary style
df.agg({ "salary" : "avg" , "age" : "max" })
Joins and Unions
join
df.join(
other: DataFrame,
on: Optional[Union[ str , List[ str ]]] = None ,
* ,
left_on: Optional[Union[ColumnOrName, List[ColumnOrName]]] = None ,
right_on: Optional[Union[ColumnOrName, List[ColumnOrName]]] = None ,
how: JoinType = "inner"
) -> DataFrame
Joins this DataFrame with another DataFrame.
Join condition(s). Can be column name(s), Column expression(s), or None for cross joins
left_on
Union[ColumnOrName, List[ColumnOrName]]
Column(s) from left DataFrame to join on
right_on
Union[ColumnOrName, List[ColumnOrName]]
Column(s) from right DataFrame to join on
Type of join: “inner”, “left”, “right”, “outer”, “cross”
Join on Column Name
Join with Expressions
Cross Join
union
df.union(other: DataFrame) -> DataFrame
Return a new DataFrame containing the union of rows in this and another DataFrame.
Another DataFrame with the same schema
A new DataFrame containing rows from both DataFrames
df1.union(df2)
# Remove duplicates after union
df1.union(df2).drop_duplicates()
This is equivalent to UNION ALL in SQL. To remove duplicates, use drop_duplicates() after union.
Sorting and Limiting
sort / order_by
df.sort(
cols: Union[ColumnOrName, List[ColumnOrName], None ] = None ,
ascending: Optional[Union[ bool , List[ bool ]]] = None
) -> DataFrame
df.order_by(
cols: Union[ColumnOrName, List[ColumnOrName], None ] = None ,
ascending: Optional[Union[ bool , List[ bool ]]] = None
) -> DataFrame
Sort the DataFrame by the specified columns.
cols
Union[ColumnOrName, List[ColumnOrName]]
Columns to sort by. Can include sorting directives like asc(), desc()
Boolean or list of booleans indicating sort order
# Sort ascending
df.sort(col( "age" ).asc())
# Sort descending
df.sort(col( "age" ).desc())
# Boolean parameter
df.sort( "age" , ascending = False )
# Multiple columns
df.sort(col( "age" ).desc(), col( "name" ).asc())
limit
df.limit(n: int ) -> DataFrame
Limits the number of rows to the specified number.
Maximum number of rows to return
DataFrame with at most n rows
Array and Struct Operations
explode
df.explode(column: ColumnOrName) -> DataFrame
Create a new row for each element in an array column.
Name of array column to explode or Column expression
New DataFrame with the array column exploded into multiple rows
df = session.create_dataframe({
"id" : [ 1 , 2 ],
"tags" : [[ "red" , "blue" ], [ "green" ]]
})
df.explode( "tags" )
# id | tags
# ---|------
# 1 | red
# 1 | blue
# 2 | green
explode_with_index
df.explode_with_index(
column: ColumnOrName,
index_col_name: str = "pos" ,
value_col_name: str = "col" ,
keep_null_and_empty: bool = False
) -> DataFrame
Create a new row for each element in an array column, with the element’s position and value.
Name of array column to explode
Name for the position column
Name for the value column
If True, preserves rows where the array is null or empty
New DataFrame with position and value columns
df.explode_with_index( "tags" )
# pos | id | tags
# ----|----|-----
# 0 | 1 | red
# 1 | 1 | blue
# 0 | 2 | green
unnest
df.unnest( * col_names: str ) -> DataFrame
Unnest the specified struct columns into separate columns.
One or more struct columns to unnest
A new DataFrame with struct columns expanded
df = session.create_dataframe({
"id" : [ 1 , 2 ],
"tags" : [{ "red" : 1 , "blue" : 2 }, { "red" : 3 }]
})
df.unnest( "tags" )
# id | red | blue
# ---|-----|-----
# 1 | 1 | 2
# 2 | 3 | null
Deduplication
distinct
df.distinct() -> DataFrame
Return a DataFrame with duplicate rows removed.
A new DataFrame with duplicate rows removed
drop_duplicates
df.drop_duplicates(subset: Optional[List[ str ]] = None ) -> DataFrame
Return a DataFrame with duplicate rows removed.
Column names to consider when identifying duplicates. If not provided, all columns are considered.
A new DataFrame with duplicate rows removed
# Remove duplicates considering all columns
df.drop_duplicates()
# Remove duplicates considering specific columns
df.drop_duplicates([ "c1" , "c2" ])
Caching
persist / cache
df.persist() -> DataFrame
df.cache() -> DataFrame
Mark this DataFrame to be persisted after first computation.
Same DataFrame, but marked for persistence
# Cache intermediate results
filtered_df = (
df.filter(col( "age" ) > 25 )
.persist() # Cache these results
)
# Both operations use cached results
result1 = filtered_df.group_by( "department" ).count()
result2 = filtered_df.select( "name" , "salary" )
Actions (Trigger Execution)
show
df.show(n: int = 10 , explain_analyze: bool = False ) -> None
Display the DataFrame content in a tabular form.
Number of rows to display
Whether to print the explain analyze plan
df.show()
df.show( 20 )
df.show( 10 , explain_analyze = True )
collect
df.collect(data_type: DataLikeType = "polars" ) -> QueryResult
Execute the DataFrame computation and return the result as a QueryResult.
data_type
DataLikeType
default: "polars"
Type of data to return: “polars”, “pandas”, “arrow”, “pydict”, “pylist”
A QueryResult with materialized data and query metrics
result = df.collect()
data = result.data
metrics = result.metrics
to_polars / to_pandas / to_arrow
df.to_polars() -> pl.DataFrame
df.to_pandas() -> pd.DataFrame
df.to_arrow() -> pa.Table
Execute the DataFrame computation and return results in specific format.
result
Union[pl.DataFrame, pd.DataFrame, pa.Table]
Materialized results in the requested format
pl_df = df.to_polars()
pd_df = df.to_pandas()
arrow_table = df.to_arrow()
to_pydict / to_pylist
df.to_pydict() -> Dict[ str , List[Any]]
df.to_pylist() -> List[Dict[ str , Any]]
Execute the DataFrame computation and return Python data structures.
result
Union[Dict[str, List[Any]], List[Dict[str, Any]]]
Materialized results as Python dictionary or list
# Dictionary of column arrays
data_dict = df.to_pydict()
# {"col1": [1, 2, 3], "col2": ["a", "b", "c"]}
# List of row dictionaries
data_list = df.to_pylist()
# [{"col1": 1, "col2": "a"}, {"col1": 2, "col2": "b"}]
count
Count the number of rows in the DataFrame.
The number of rows in the DataFrame
Lineage
lineage
Create a Lineage object to trace data through transformations.
Interface for querying data lineage
lineage = df.lineage()
# Trace backwards
source_rows = lineage.backward([ "result_uuid1" , "result_uuid2" ])
# Trace forwards
result_rows = lineage.forward([ "source_uuid1" ])
Utility Methods
explain
Display the logical plan of the DataFrame.
Aliases
withColumn → with_column
withColumns → with_columns
withColumnRenamed → with_column_renamed
groupBy / groupby → group_by
dropDuplicates → drop_duplicates
orderBy → order_by
unionAll → union