Skip to main content
The DataFrame class represents a lazily evaluated computation on data. Operations build up a logical query plan that is only executed when an action is called.
Direct construction using DataFrame() is not allowed. Create DataFrames through Session.create_dataframe() or other Session methods.

Properties

schema

df.schema -> Schema
Get the schema of this DataFrame.
Schema
Schema
Schema containing field names and data types
df.schema
# Schema([
#     ColumnField('name', StringType),
#     ColumnField('age', IntegerType)
# ])

columns

df.columns -> List[str]
Get list of column names.
columns
List[str]
List of all column names in the DataFrame
df.columns
# ['name', 'age', 'city']

semantic

df.semantic -> SemanticExtensions
Interface for semantic operations (LLM-powered transformations) on the DataFrame.
SemanticExtensions
SemanticExtensions
Semantic operations interface
# Semantic extraction
df.semantic.extract(
    "content",
    schema=MyPydanticModel,
    model="gpt-4o-mini"
)

write

df.write -> DataFrameWriter
Interface for saving the content of the DataFrame.
DataFrameWriter
DataFrameWriter
Writer interface to write DataFrame to various formats
# Write to Parquet
df.write.parquet("output.parquet")

# Write to CSV
df.write.csv("output.csv")

Selection and Filtering

select

df.select(*cols: ColumnOrName) -> DataFrame
Projects a set of Column expressions or column names.
cols
ColumnOrName
required
Column expressions to select. Can be string column names or Column objects.
DataFrame
DataFrame
A new DataFrame with selected columns
# Select by column names
df.select("name", "age")

# Select with expressions
df.select(col("name"), col("age") + 1)

# Mix strings and expressions
df.select("name", col("age") * 2)

filter / where

df.filter(condition: Column) -> DataFrame
df.where(condition: Column) -> DataFrame
Filters rows using the given condition.
condition
Column
required
A Column expression that evaluates to a boolean
DataFrame
DataFrame
Filtered DataFrame
# Numeric comparison
df.filter(col("age") > 25)

# Multiple conditions
df.filter((col("age") > 25) & (col("age") <= 35))

# String matching
df.filter(col("name").contains("Alice"))

drop

df.drop(*col_names: str) -> DataFrame
Remove one or more columns from this DataFrame.
col_names
str
required
Names of columns to drop
DataFrame
DataFrame
New DataFrame without specified columns
# Drop single column
df.drop("age")

# Drop multiple columns
df.drop("id", "age")

Adding and Modifying Columns

with_column

df.with_column(
    col_name: str,
    col: Union[Any, Column, pl.Series, pd.Series]
) -> DataFrame
Add a new column or replace an existing column.
col_name
str
required
Name of the new column
col
Union[Any, Column, pl.Series, pd.Series]
required
Column expression, Series, or value to assign:
  • Column: A Column expression
  • pl.Series or pd.Series: A Series (length must match DataFrame height)
  • Any other value: Treated as a literal (broadcast to all rows)
DataFrame
DataFrame
New DataFrame with added/replaced column
df.with_column("constant", 1)

with_columns

df.with_columns(
    cols_map: Dict[str, Union[Any, Column, pl.Series, pd.Series]]
) -> DataFrame
Add multiple new columns or replace existing columns.
cols_map
Dict[str, Union[Any, Column, pl.Series, pd.Series]]
required
Dictionary where keys are column names and values are Column expressions, Series, or literal values
DataFrame
DataFrame
New DataFrame with added/replaced columns
df.with_columns({
    "double_age": col("age") * 2,
    "constant": 1,
    "age_plus_10": col("age") + 10
})
All columns are created at once, so new columns cannot depend on each other. Use chained with_column() calls for dependent columns.

with_column_renamed

df.with_column_renamed(col_name: str, new_col_name: str) -> DataFrame
Rename a column. No-op if the column does not exist.
col_name
str
required
Name of the column to rename
new_col_name
str
required
New name for the column
DataFrame
DataFrame
New DataFrame with the column renamed
df.with_column_renamed("age", "age_in_years")

Aggregations and Grouping

group_by

df.group_by(*cols: ColumnOrName) -> GroupedData
Groups the DataFrame using the specified columns.
cols
ColumnOrName
required
Columns to group by. Can be column names as strings or Column expressions.
GroupedData
GroupedData
Object for performing aggregations on the grouped data
# Group by single column
df.group_by("department").agg(count("*"))

# Group by multiple columns
df.group_by("department", "location").agg({"salary": "avg"})

# Group by expression
df.group_by(lower(col("department"))).agg(count("*"))

agg

df.agg(*exprs: Union[Column, Dict[str, str]]) -> DataFrame
Aggregate on the entire DataFrame without groups.
exprs
Union[Column, Dict[str, str]]
required
Aggregation expressions or dictionary of aggregations
DataFrame
DataFrame
Aggregation results
# Multiple aggregations
df.agg(
    count().alias("total_rows"),
    avg(col("salary")).alias("avg_salary")
)

# Dictionary style
df.agg({"salary": "avg", "age": "max"})

Joins and Unions

join

df.join(
    other: DataFrame,
    on: Optional[Union[str, List[str]]] = None,
    *,
    left_on: Optional[Union[ColumnOrName, List[ColumnOrName]]] = None,
    right_on: Optional[Union[ColumnOrName, List[ColumnOrName]]] = None,
    how: JoinType = "inner"
) -> DataFrame
Joins this DataFrame with another DataFrame.
other
DataFrame
required
DataFrame to join with
on
Union[str, List[str]]
Join condition(s). Can be column name(s), Column expression(s), or None for cross joins
left_on
Union[ColumnOrName, List[ColumnOrName]]
Column(s) from left DataFrame to join on
right_on
Union[ColumnOrName, List[ColumnOrName]]
Column(s) from right DataFrame to join on
how
JoinType
default:"inner"
Type of join: “inner”, “left”, “right”, “outer”, “cross”
DataFrame
DataFrame
Joined DataFrame
df1.join(df2, on="id")

union

df.union(other: DataFrame) -> DataFrame
Return a new DataFrame containing the union of rows in this and another DataFrame.
other
DataFrame
required
Another DataFrame with the same schema
DataFrame
DataFrame
A new DataFrame containing rows from both DataFrames
df1.union(df2)

# Remove duplicates after union
df1.union(df2).drop_duplicates()
This is equivalent to UNION ALL in SQL. To remove duplicates, use drop_duplicates() after union.

Sorting and Limiting

sort / order_by

df.sort(
    cols: Union[ColumnOrName, List[ColumnOrName], None] = None,
    ascending: Optional[Union[bool, List[bool]]] = None
) -> DataFrame
df.order_by(
    cols: Union[ColumnOrName, List[ColumnOrName], None] = None,
    ascending: Optional[Union[bool, List[bool]]] = None
) -> DataFrame
Sort the DataFrame by the specified columns.
cols
Union[ColumnOrName, List[ColumnOrName]]
Columns to sort by. Can include sorting directives like asc(), desc()
ascending
Union[bool, List[bool]]
Boolean or list of booleans indicating sort order
DataFrame
DataFrame
Sorted DataFrame
# Sort ascending
df.sort(col("age").asc())

# Sort descending
df.sort(col("age").desc())

# Boolean parameter
df.sort("age", ascending=False)

# Multiple columns
df.sort(col("age").desc(), col("name").asc())

limit

df.limit(n: int) -> DataFrame
Limits the number of rows to the specified number.
n
int
required
Maximum number of rows to return
DataFrame
DataFrame
DataFrame with at most n rows
df.limit(10)

Array and Struct Operations

explode

df.explode(column: ColumnOrName) -> DataFrame
Create a new row for each element in an array column.
column
ColumnOrName
required
Name of array column to explode or Column expression
DataFrame
DataFrame
New DataFrame with the array column exploded into multiple rows
df = session.create_dataframe({
    "id": [1, 2],
    "tags": [["red", "blue"], ["green"]]
})

df.explode("tags")
# id | tags
# ---|------
#  1 | red
#  1 | blue
#  2 | green

explode_with_index

df.explode_with_index(
    column: ColumnOrName,
    index_col_name: str = "pos",
    value_col_name: str = "col",
    keep_null_and_empty: bool = False
) -> DataFrame
Create a new row for each element in an array column, with the element’s position and value.
column
ColumnOrName
required
Name of array column to explode
index_col_name
str
default:"pos"
Name for the position column
value_col_name
str
default:"col"
Name for the value column
keep_null_and_empty
bool
default:"False"
If True, preserves rows where the array is null or empty
DataFrame
DataFrame
New DataFrame with position and value columns
df.explode_with_index("tags")
# pos | id | tags
# ----|----|----- 
#   0 |  1 | red
#   1 |  1 | blue
#   0 |  2 | green

unnest

df.unnest(*col_names: str) -> DataFrame
Unnest the specified struct columns into separate columns.
col_names
str
required
One or more struct columns to unnest
DataFrame
DataFrame
A new DataFrame with struct columns expanded
df = session.create_dataframe({
    "id": [1, 2],
    "tags": [{"red": 1, "blue": 2}, {"red": 3}]
})

df.unnest("tags")
# id | red | blue
# ---|-----|-----
#  1 |   1 |    2
#  2 |   3 | null

Deduplication

distinct

df.distinct() -> DataFrame
Return a DataFrame with duplicate rows removed.
DataFrame
DataFrame
A new DataFrame with duplicate rows removed
df.distinct()

drop_duplicates

df.drop_duplicates(subset: Optional[List[str]] = None) -> DataFrame
Return a DataFrame with duplicate rows removed.
subset
List[str]
Column names to consider when identifying duplicates. If not provided, all columns are considered.
DataFrame
DataFrame
A new DataFrame with duplicate rows removed
# Remove duplicates considering all columns
df.drop_duplicates()

# Remove duplicates considering specific columns
df.drop_duplicates(["c1", "c2"])

Caching

persist / cache

df.persist() -> DataFrame
df.cache() -> DataFrame
Mark this DataFrame to be persisted after first computation.
DataFrame
DataFrame
Same DataFrame, but marked for persistence
# Cache intermediate results
filtered_df = (
    df.filter(col("age") > 25)
    .persist()  # Cache these results
)

# Both operations use cached results
result1 = filtered_df.group_by("department").count()
result2 = filtered_df.select("name", "salary")

Actions (Trigger Execution)

show

df.show(n: int = 10, explain_analyze: bool = False) -> None
Display the DataFrame content in a tabular form.
n
int
default:"10"
Number of rows to display
explain_analyze
bool
default:"False"
Whether to print the explain analyze plan
df.show()
df.show(20)
df.show(10, explain_analyze=True)

collect

df.collect(data_type: DataLikeType = "polars") -> QueryResult
Execute the DataFrame computation and return the result as a QueryResult.
data_type
DataLikeType
default:"polars"
Type of data to return: “polars”, “pandas”, “arrow”, “pydict”, “pylist”
QueryResult
QueryResult
A QueryResult with materialized data and query metrics
result = df.collect()
data = result.data
metrics = result.metrics

to_polars / to_pandas / to_arrow

df.to_polars() -> pl.DataFrame
df.to_pandas() -> pd.DataFrame
df.to_arrow() -> pa.Table
Execute the DataFrame computation and return results in specific format.
result
Union[pl.DataFrame, pd.DataFrame, pa.Table]
Materialized results in the requested format
pl_df = df.to_polars()
pd_df = df.to_pandas()
arrow_table = df.to_arrow()

to_pydict / to_pylist

df.to_pydict() -> Dict[str, List[Any]]
df.to_pylist() -> List[Dict[str, Any]]
Execute the DataFrame computation and return Python data structures.
result
Union[Dict[str, List[Any]], List[Dict[str, Any]]]
Materialized results as Python dictionary or list
# Dictionary of column arrays
data_dict = df.to_pydict()
# {"col1": [1, 2, 3], "col2": ["a", "b", "c"]}

# List of row dictionaries
data_list = df.to_pylist()
# [{"col1": 1, "col2": "a"}, {"col1": 2, "col2": "b"}]

count

df.count() -> int
Count the number of rows in the DataFrame.
count
int
The number of rows in the DataFrame
num_rows = df.count()

Lineage

lineage

df.lineage() -> Lineage
Create a Lineage object to trace data through transformations.
Lineage
Lineage
Interface for querying data lineage
lineage = df.lineage()

# Trace backwards
source_rows = lineage.backward(["result_uuid1", "result_uuid2"])

# Trace forwards
result_rows = lineage.forward(["source_uuid1"])

Utility Methods

explain

df.explain() -> None
Display the logical plan of the DataFrame.
df.explain()

Aliases

  • withColumnwith_column
  • withColumnswith_columns
  • withColumnRenamedwith_column_renamed
  • groupBy / groupbygroup_by
  • dropDuplicatesdrop_duplicates
  • orderByorder_by
  • unionAllunion

Build docs developers (and LLMs) love