Configuration Templates

Overview

The pipeline includes two optimized configuration templates that represent common deployment scenarios. Templates are JSON files that can be loaded via the --config-template CLI argument.

Edge Template

Optimized for resource-constrained edge devices with limited memory and compute resources. Location: configs/pipeline.edge.template.json

{
  "random_seed": 42,
  "chunk_size": 64,
  "batch_size": 96,
  "n_jobs": 1,
  "max_memory_mb": 256,
  "max_compute_units": 0.4,
  "benchmark_runs": 3,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": true,
  "output_dir": "artifacts_edge"
}

Edge Template Characteristics

chunk_size

int

default:"64"

Small chunks minimize memory footprint for edge devices

batch_size

int

default:"96"

Conservative batch size to avoid memory exhaustion

n_jobs

int

default:"1"

Single-threaded execution to reduce overhead on limited cores

max_memory_mb

int

default:"256"

Strict 256MB memory limit for edge deployment

max_compute_units

float

default:"0.4"

Throttled to 40% to leave resources for other processes

benchmark_runs

int

default:"3"

Fewer benchmark iterations to reduce processing time

spill_to_disk

bool

default:"true"

Enabled - Critical for handling datasets larger than available RAM

Use Cases

Raspberry Pi or similar single-board computers
IoT devices with limited resources
Mobile or embedded systems
Environments where memory is <512MB

Server Template

Optimized for high-performance server environments with ample resources. Location: configs/pipeline.server.template.json

{
  "random_seed": 42,
  "chunk_size": 256,
  "batch_size": 512,
  "n_jobs": 4,
  "max_memory_mb": 4096,
  "max_compute_units": 1.0,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_server"
}

Server Template Characteristics

chunk_size

int

default:"256"

Large chunks maximize throughput on powerful hardware

batch_size

int

default:"512"

Large batches leverage vectorization for faster processing

n_jobs

int

default:"4"

Multi-threaded execution for parallel processing

max_memory_mb

int

default:"4096"

Generous 4GB memory allocation for complex operations

max_compute_units

float

default:"1.0"

Full compute resources available (100%)

benchmark_runs

int

default:"5"

More iterations for statistically robust benchmarks

spill_to_disk

bool

default:"false"

Disabled - Keep all data in memory for maximum performance

Use Cases

Cloud compute instances (AWS, GCP, Azure)
On-premise data processing servers
Development workstations
Environments with >8GB RAM

Using Templates

Load Template via CLI

python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json

Override Template Values

CLI arguments take precedence over template values:

python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json \
  --chunk-size 128 \
  --n-jobs 2

This loads the edge template but overrides chunk_size to 128 and n_jobs to 2.

Load Template in Python

import json
from pathlib import Path
from pipeline.config import PipelineConfig

# Load template
template_path = Path('configs/pipeline.server.template.json')
template_values = json.loads(template_path.read_text(encoding='utf-8'))

# Convert output_dir string to Path
template_values['output_dir'] = Path(template_values['output_dir'])

# Create config from template
config = PipelineConfig(**template_values)

Creating Custom Templates

You can create your own templates for specific environments:

{
  "random_seed": 42,
  "chunk_size": 192,
  "batch_size": 384,
  "n_jobs": 2,
  "max_memory_mb": 2048,
  "max_compute_units": 0.75,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_custom"
}

Save as JSON and load with --config-template path/to/your/template.json.

Template Selection Guide

Criteria	Edge Template	Server Template
Available RAM	<512MB	>4GB
CPU Cores	1-2	4+
Dataset Size	<100MB	Any size
Priority	Resource efficiency	Maximum performance
Disk Spilling	Enabled	Disabled
Processing Time	Slower, conservative	Faster, aggressive

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

Overview

Edge Template

Edge Template Characteristics

Use Cases

Server Template

Server Template Characteristics

Use Cases

Using Templates

Load Template via CLI

Override Template Values

Load Template in Python

Creating Custom Templates

Template Selection Guide

Next Steps

Configuration Overview

CLI Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Pipeline Stages

Configuration

Performance

Deployment

​Overview

​Edge Template

​Edge Template Characteristics

​Use Cases

​Server Template

​Server Template Characteristics

​Use Cases

​Using Templates

​Load Template via CLI

​Override Template Values

​Load Template in Python

​Creating Custom Templates

​Template Selection Guide

​Next Steps

Configuration Overview

CLI Reference

Build docs developers (and LLMs) love

Overview

Edge Template

Edge Template Characteristics

Use Cases

Server Template

Server Template Characteristics

Use Cases

Using Templates

Load Template via CLI

Override Template Values

Load Template in Python

Creating Custom Templates

Template Selection Guide

Next Steps