Skip to main content
The GroupedData class provides methods for performing aggregations on grouped DataFrames. It is returned by DataFrame.group_by() and should not be instantiated directly.
GroupedData objects are created by calling df.group_by() on a DataFrame.

Creating GroupedData

GroupedData is returned by the group_by() method on DataFrames:
# Create grouped data
grouped = df.group_by("department")

# Group by multiple columns
grouped = df.group_by("department", "location")

# Group by expressions
grouped = df.group_by(lower(col("department")))

Aggregation Methods

agg

grouped.agg(*exprs: Union[Column, Dict[str, str]]) -> DataFrame
Compute aggregations on grouped data and return the result as a DataFrame.
exprs
Union[Column, Dict[str, str]]
required
Aggregation expressions. Can be:
  • Column expressions with aggregate functions (e.g., count("*"), sum("amount"))
  • A dictionary mapping column names to aggregate function names (e.g., {"amount": "sum", "age": "avg"})
DataFrame
DataFrame
A new DataFrame with one row per group and columns for group keys and aggregated values
from fenic.api.functions import count

# Count employees by department
df.group_by("department").agg(
    count("*").alias("employee_count")
)

Available Aggregate Functions

The following aggregate functions can be used with agg():

Count Functions

  • count(col) - Count non-null values
  • count("*") - Count all rows including nulls
  • count_distinct(col) - Count distinct values

Numeric Aggregations

  • sum(col) - Sum of values
  • avg(col) - Average of values
  • mean(col) - Alias for avg
  • min(col) - Minimum value
  • max(col) - Maximum value
  • stddev(col) - Standard deviation
  • variance(col) - Variance

Collection Functions

  • collect_list(col) - Collect values into a list
  • collect_set(col) - Collect unique values into a set

Statistical Functions

  • first(col) - First value in the group
  • last(col) - Last value in the group
  • approx_count_distinct(col) - Approximate count of distinct values

Dictionary Aggregation Syntax

When using dictionary syntax, the following function names are supported:
  • "sum" - Sum of values
  • "avg" or "mean" - Average
  • "min" - Minimum
  • "max" - Maximum
  • "count" - Count non-null
  • "stddev" - Standard deviation
  • "variance" - Variance
  • "first" - First value
  • "last" - Last value
df.group_by("category").agg({
    "amount": "sum",
    "price": "avg",
    "quantity": "max",
    "id": "count"
})

Usage Patterns

Single Group Column

# Group by department and count
result = df.group_by("department").agg(
    count("*").alias("total")
)

# Output:
# department | total
# -----------|------
# IT         |    10
# HR         |     5
# Sales      |     8

Multiple Group Columns

# Group by department and location
result = df.group_by("department", "location").agg(
    avg("salary").alias("avg_salary"),
    count("*").alias("count")
)

# Output:
# department | location | avg_salary | count
# -----------|----------|------------|------
# IT         | NYC      |   85000.0  |    5
# IT         | LA       |   80000.0  |    5
# HR         | NYC      |   70000.0  |    3

Complex Aggregations

from fenic.api.functions import sum, avg, min, max, count

# Multiple metrics per group
result = df.group_by("department").agg(
    count("*").alias("employee_count"),
    sum("salary").alias("total_salary"),
    avg("salary").alias("avg_salary"),
    min("salary").alias("min_salary"),
    max("salary").alias("max_salary"),
    sum(col("salary") * col("bonus_pct")).alias("total_bonus")
)

Filtering After Aggregation

# Group and filter
result = (
    df.group_by("department")
    .agg(count("*").alias("employee_count"))
    .filter(col("employee_count") > 5)
)

Sorting Aggregated Results

# Group, aggregate, and sort
result = (
    df.group_by("department")
    .agg(
        avg("salary").alias("avg_salary"),
        count("*").alias("count")
    )
    .sort(col("avg_salary").desc())
)

Common Patterns

Calculate Percentages

# Count by category and calculate percentage
total = df.count()
result = (
    df.group_by("category")
    .agg(count("*").alias("count"))
    .with_column(
        "percentage",
        (col("count") / total) * 100
    )
)

Top N per Group

from fenic.api.window import Window
from fenic.api.functions import row_number

# Get top 3 salaries per department
window = Window.partition_by("department").order_by(col("salary").desc())

result = (
    df.with_column("rank", row_number().over(window))
    .filter(col("rank") <= 3)
)

Multiple Aggregation Levels

# Department-level aggregation
dept_stats = df.group_by("department").agg(
    avg("salary").alias("dept_avg_salary")
)

# Location-level aggregation
loc_stats = df.group_by("location").agg(
    avg("salary").alias("loc_avg_salary")
)

# Combine both
result = (
    df.join(dept_stats, on="department")
    .join(loc_stats, on="location")
)

Error Handling

Common Errors:
  • Using non-aggregate functions in agg() will raise a ValidationError
  • Dictionary values must be valid aggregate function names
  • Grouping by literal values raises ValueError
# ERROR: Cannot use non-aggregate function
df.group_by("department").agg(
    col("name")  # ValidationError!
)

# CORRECT: Use aggregate function
df.group_by("department").agg(
    collect_list("name").alias("names")
)

Notes

  • The order of rows in the output is not guaranteed unless you explicitly sort
  • NULL values are treated as a separate group
  • Empty groups are not included in the result
  • All aggregate functions ignore NULL values except count("*")

Build docs developers (and LLMs) love