Skip to main content

Overview

The pipeline includes two optimized configuration templates that represent common deployment scenarios. Templates are JSON files that can be loaded via the --config-template CLI argument.

Edge Template

Optimized for resource-constrained edge devices with limited memory and compute resources. Location: configs/pipeline.edge.template.json
{
  "random_seed": 42,
  "chunk_size": 64,
  "batch_size": 96,
  "n_jobs": 1,
  "max_memory_mb": 256,
  "max_compute_units": 0.4,
  "benchmark_runs": 3,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": true,
  "output_dir": "artifacts_edge"
}

Edge Template Characteristics

chunk_size
int
default:"64"
Small chunks minimize memory footprint for edge devices
batch_size
int
default:"96"
Conservative batch size to avoid memory exhaustion
n_jobs
int
default:"1"
Single-threaded execution to reduce overhead on limited cores
max_memory_mb
int
default:"256"
Strict 256MB memory limit for edge deployment
max_compute_units
float
default:"0.4"
Throttled to 40% to leave resources for other processes
benchmark_runs
int
default:"3"
Fewer benchmark iterations to reduce processing time
spill_to_disk
bool
default:"true"
Enabled - Critical for handling datasets larger than available RAM

Use Cases

  • Raspberry Pi or similar single-board computers
  • IoT devices with limited resources
  • Mobile or embedded systems
  • Environments where memory is <512MB

Server Template

Optimized for high-performance server environments with ample resources. Location: configs/pipeline.server.template.json
{
  "random_seed": 42,
  "chunk_size": 256,
  "batch_size": 512,
  "n_jobs": 4,
  "max_memory_mb": 4096,
  "max_compute_units": 1.0,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_server"
}

Server Template Characteristics

chunk_size
int
default:"256"
Large chunks maximize throughput on powerful hardware
batch_size
int
default:"512"
Large batches leverage vectorization for faster processing
n_jobs
int
default:"4"
Multi-threaded execution for parallel processing
max_memory_mb
int
default:"4096"
Generous 4GB memory allocation for complex operations
max_compute_units
float
default:"1.0"
Full compute resources available (100%)
benchmark_runs
int
default:"5"
More iterations for statistically robust benchmarks
spill_to_disk
bool
default:"false"
Disabled - Keep all data in memory for maximum performance

Use Cases

  • Cloud compute instances (AWS, GCP, Azure)
  • On-premise data processing servers
  • Development workstations
  • Environments with >8GB RAM

Using Templates

Load Template via CLI

python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json

Override Template Values

CLI arguments take precedence over template values:
python run_pipeline.py \
  --input data/raw_nba_data.csv \
  --config-template configs/pipeline.edge.template.json \
  --chunk-size 128 \
  --n-jobs 2
This loads the edge template but overrides chunk_size to 128 and n_jobs to 2.

Load Template in Python

import json
from pathlib import Path
from pipeline.config import PipelineConfig

# Load template
template_path = Path('configs/pipeline.server.template.json')
template_values = json.loads(template_path.read_text(encoding='utf-8'))

# Convert output_dir string to Path
template_values['output_dir'] = Path(template_values['output_dir'])

# Create config from template
config = PipelineConfig(**template_values)

Creating Custom Templates

You can create your own templates for specific environments:
{
  "random_seed": 42,
  "chunk_size": 192,
  "batch_size": 384,
  "n_jobs": 2,
  "max_memory_mb": 2048,
  "max_compute_units": 0.75,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_custom"
}
Save as JSON and load with --config-template path/to/your/template.json.

Template Selection Guide

CriteriaEdge TemplateServer Template
Available RAM<512MB>4GB
CPU Cores1-24+
Dataset Size<100MBAny size
PriorityResource efficiencyMaximum performance
Disk SpillingEnabledDisabled
Processing TimeSlower, conservativeFaster, aggressive

Next Steps

Configuration Overview

Learn about all configuration options

CLI Reference

See all command-line arguments

Build docs developers (and LLMs) love