Data Transform

Overview

The data transform tool provides a pipeline-based approach to transforming structured data. It supports CSV, JSON, and YAML formats, and applies operations sequentially: filter, select, sort, and aggregate.

Tool

data_transform

Transform structured data files using a pipeline of operations.

input_file

string

required

Path to the input data file (CSV, JSON, or YAML)

output_file

string

Path for the output file. Extension determines format. If omitted, returns transformed data as JSON.

operations

array

Pipeline of operations applied sequentially. Each operation is an object with a type field.

Example:

result = await data_transform(
    input_file="sales_data.csv",
    output_file="filtered_sales.json",
    operations=[
        {
            "type": "filter",
            "filter": {"column": "revenue", "op": ">", "value": 1000}
        },
        {
            "type": "sort",
            "sort": {"by": "revenue", "reverse": true}
        }
    ]
)

Operations

filter

Filter rows based on column value and comparison operator. Filter specification:

column: Column name to filter on
op: Comparison operator
value: Value to compare against

Supported operators:

==: Equals (string comparison)
!=: Not equals
>: Greater than (numeric)
<: Less than (numeric)
>=: Greater than or equal (numeric)
<=: Less than or equal (numeric)
contains: Case-insensitive substring match

Example:

operations=[
    {
        "type": "filter",
        "filter": {"column": "status", "op": "==", "value": "active"}
    }
]

Numeric filtering:

operations=[
    {
        "type": "filter",
        "filter": {"column": "price", "op": ">=", "value": 50}
    }
]

Substring search:

operations=[
    {
        "type": "filter",
        "filter": {"column": "description", "op": "contains", "value": "premium"}
    }
]

select

Select only specified columns from each row. Select specification:

columns: List of column names to keep

Example:

operations=[
    {
        "type": "select",
        "columns": ["id", "name", "email"]
    }
]

sort

Sort data by a column in ascending or descending order. Sort specification:

by: Column name to sort by
reverse: Boolean, true for descending order (default: false)

Example (ascending):

operations=[
    {
        "type": "sort",
        "sort": {"by": "created_at", "reverse": false}
    }
]

Example (descending):

operations=[
    {
        "type": "sort",
        "sort": {"by": "revenue", "reverse": true}
    }
]

Auto-detection: Attempts numeric sorting first, falls back to case-insensitive string sorting.

aggregate

Group by a column and compute aggregate statistics. Aggregate specification:

group_by: Column to group by
agg: Aggregation function (count, sum, avg, min, max)
value_column: Column to aggregate (required for sum/avg/min/max)

Example (count):

operations=[
    {
        "type": "aggregate",
        "aggregate": {"group_by": "category", "agg": "count"}
    }
]

Example (sum):

operations=[
    {
        "type": "aggregate",
        "aggregate": {
            "group_by": "region",
            "agg": "sum",
            "value_column": "sales"
        }
    }
]

Example (average):

operations=[
    {
        "type": "aggregate",
        "aggregate": {
            "group_by": "department",
            "agg": "avg",
            "value_column": "salary"
        }
    }
]

Output format:

Count: {group_by_column: value, count: N}
Sum/Avg/Min/Max: {group_by_column: value, agg_value_column: result}

Pipeline Example

Combine multiple operations for complex transformations:

result = await data_transform(
    input_file="transactions.csv",
    output_file="summary.json",
    operations=[
        # Keep only completed transactions
        {
            "type": "filter",
            "filter": {"column": "status", "op": "==", "value": "completed"}
        },
        # Keep only high-value transactions
        {
            "type": "filter",
            "filter": {"column": "amount", "op": ">", "value": 100}
        },
        # Select relevant columns
        {
            "type": "select",
            "columns": ["customer_id", "amount", "region"]
        },
        # Group by region and sum amounts
        {
            "type": "aggregate",
            "aggregate": {
                "group_by": "region",
                "agg": "sum",
                "value_column": "amount"
            }
        },
        # Sort by total amount descending
        {
            "type": "sort",
            "sort": {"by": "sum_amount", "reverse": true}
        }
    ]
)

Input Formats

CSV

Reads CSV files using Python’s csv.DictReader. First row is treated as headers.

JSON

Supports three JSON structures:

Array of objects: [{"id": 1, "name": "Alice"}, ...]
Single object: {"id": 1, "name": "Alice"} (wrapped in array)
Primitive value: 42 (wrapped as [{"value": 42}])

YAML

Requires PyYAML package. Supports same structures as JSON. Install YAML support:

pip install pyyaml

Output Formats

Output format is determined by file extension:

.csv: CSV with headers
.json: JSON with 2-space indentation
.yaml or .yml: YAML with default flow style

Without output_file: Returns transformed data as JSON preview (first 20 records).

Return Values

With output_file:

Wrote 42 records to filtered_sales.json

Preview (first 5 rows):
[
  {"id": 1, "name": "Product A", "revenue": 1500},
  {"id": 3, "name": "Product C", "revenue": 1200},
  ...
]

Without output_file:

Transformed 42 records:
[
  {"id": 1, "name": "Product A"},
  ...
]

Use Cases

Data cleaning: Filter out invalid or incomplete records
Report generation: Select and sort specific columns for reporting
Format conversion: Convert between CSV, JSON, and YAML
Data aggregation: Group and summarize large datasets
ETL pipelines: Transform data before loading into databases

Implementation

Defined in grip/tools/data_transform.py at data_transform.py:183. Uses:

Python’s built-in csv and json modules
Optional PyYAML for YAML support
Sequential pipeline processing (operations applied in order)
Type coercion for numeric comparisons in filters
Graceful handling of missing columns and invalid values

REST API

CLI Commands

Built-in Tools

Overview

Tool

data_transform

Operations

filter

select

sort

aggregate

Pipeline Example

Input Formats

CSV

JSON

YAML

Output Formats

Return Values

Use Cases

Implementation

Build docs developers (and LLMs) love

REST API

CLI Commands

Built-in Tools

​Overview

​Tool

​data_transform

​Operations

​filter

​select

​sort

​aggregate

​Pipeline Example

​Input Formats

​CSV

​JSON

​YAML

​Output Formats

​Return Values

​Use Cases

​Implementation

Build docs developers (and LLMs) love

Overview

Tool

data_transform

Operations

filter

select

sort

aggregate

Pipeline Example

Input Formats

CSV

JSON

YAML

Output Formats

Return Values

Use Cases

Implementation