Overview
The data transform tool provides a pipeline-based approach to transforming structured data. It supports CSV, JSON, and YAML formats, and applies operations sequentially: filter, select, sort, and aggregate.Tool
data_transform
Transform structured data files using a pipeline of operations.Path to the input data file (CSV, JSON, or YAML)
Path for the output file. Extension determines format. If omitted, returns transformed data as JSON.
Pipeline of operations applied sequentially. Each operation is an object with a
type field.Operations
filter
Filter rows based on column value and comparison operator. Filter specification:column: Column name to filter onop: Comparison operatorvalue: Value to compare against
==: Equals (string comparison)!=: Not equals>: Greater than (numeric)<: Less than (numeric)>=: Greater than or equal (numeric)<=: Less than or equal (numeric)contains: Case-insensitive substring match
select
Select only specified columns from each row. Select specification:columns: List of column names to keep
sort
Sort data by a column in ascending or descending order. Sort specification:by: Column name to sort byreverse: Boolean, true for descending order (default: false)
aggregate
Group by a column and compute aggregate statistics. Aggregate specification:group_by: Column to group byagg: Aggregation function (count,sum,avg,min,max)value_column: Column to aggregate (required for sum/avg/min/max)
- Count:
{group_by_column: value, count: N} - Sum/Avg/Min/Max:
{group_by_column: value, agg_value_column: result}
Pipeline Example
Combine multiple operations for complex transformations:Input Formats
CSV
Reads CSV files using Python’scsv.DictReader. First row is treated as headers.
JSON
Supports three JSON structures:- Array of objects:
[{"id": 1, "name": "Alice"}, ...] - Single object:
{"id": 1, "name": "Alice"}(wrapped in array) - Primitive value:
42(wrapped as[{"value": 42}])
YAML
RequiresPyYAML package. Supports same structures as JSON.
Install YAML support:
Output Formats
Output format is determined by file extension:.csv: CSV with headers.json: JSON with 2-space indentation.yamlor.yml: YAML with default flow style
Return Values
With output_file:Use Cases
- Data cleaning: Filter out invalid or incomplete records
- Report generation: Select and sort specific columns for reporting
- Format conversion: Convert between CSV, JSON, and YAML
- Data aggregation: Group and summarize large datasets
- ETL pipelines: Transform data before loading into databases
Implementation
Defined ingrip/tools/data_transform.py at data_transform.py:183. Uses:
- Python’s built-in
csvandjsonmodules - Optional
PyYAMLfor YAML support - Sequential pipeline processing (operations applied in order)
- Type coercion for numeric comparisons in filters
- Graceful handling of missing columns and invalid values