Skip to main content

Overview

The server deployment configuration is optimized for high-throughput batch processing on machines with abundant resources. This mode prioritizes speed and model quality over memory conservation.

Resource Configuration

Default Constraints

Server deployments leverage available compute and memory:
  • Memory limit: 4096 MB (4 GB)
  • Compute units: 1.0 (100% utilization)
  • Chunk size: 256 rows per chunk
  • Batch size: 512 rows
  • Parallel jobs: 4 (concurrent processing)

Performance Characteristics

  • Higher throughput: 500-2000+ rows/second depending on hardware
  • Lower latency: Larger chunks reduce overhead
  • Better model quality: Full compute capacity enables more sophisticated features
  • Parallel benchmarking: Multiple constraint experiments run concurrently

Configuration

Server Configuration Template

Location: configs/pipeline.server.template.json
{
  "random_seed": 42,
  "chunk_size": 256,
  "batch_size": 512,
  "n_jobs": 4,
  "max_memory_mb": 4096,
  "max_compute_units": 1.0,
  "benchmark_runs": 5,
  "adaptive_chunk_resize": true,
  "max_chunk_retries": 3,
  "spill_to_disk": false,
  "output_dir": "artifacts_server"
}

Configuration Parameters

ParameterValuePurpose
chunk_size256Larger chunks for better throughput
batch_size512Larger batches reduce model update overhead
n_jobs4Parallel execution for constraint experiments
max_memory_mb4096Generous memory allocation
max_compute_units1.0Full CPU utilization
benchmark_runs5More runs for statistical significance
spill_to_diskfalseMemory is sufficient, avoid I/O overhead

Running on Servers

Basic Deployment

cd "NBA Data Preprocessing/task"
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.server.template.json

High-Memory Configuration

For servers with 16+ GB RAM:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.server.template.json \
  --max-memory-mb 8192 \
  --chunk-size 512 \
  --batch-size 1024

Maximum Parallelism

Utilize all CPU cores:
python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.server.template.json \
  --n-jobs -1
Setting --n-jobs -1 uses all available CPU cores. This significantly accelerates constraint experiments but may increase timing variance.

Performance Optimization

Scaling Parameters

The pipeline automatically adjusts batch and chunk sizes based on resource constraints:
# From engine.py:55-60
memory_factor = max(0.1, min(1.0, memory_cap / 1024))
compute_factor = max(0.1, min(1.0, compute_cap))
scale = memory_factor * compute_factor
adjusted_batch = max(16, int(batch_base * scale))
adjusted_chunk = max(16, int(chunk_base * scale))
With server settings (4096 MB, 1.0 compute):
  • Memory factor: 1.0 (4096 / 1024)
  • Compute factor: 1.0
  • No downscaling applied - full performance

Parallel Benchmark Execution

Constraint experiments run in parallel when n_jobs > 1:
# From engine.py:397-402
if self.config.n_jobs > 1:
    experiment_rows = Parallel(n_jobs=self.config.n_jobs)(
        delayed(self._single_constraint_run)(source, c, m, cp) for c, m, cp in tasks
    )
else:
    experiment_rows = [self._single_constraint_run(source, c, m, cp) for c, m, cp in tasks]
With n_jobs: 4, a 12-configuration constraint sweep completes in ~1/4 the time compared to sequential execution.

Deployment Architecture

Minimum:
  • 4 CPU cores
  • 8 GB RAM
  • 10 GB free disk space
  • SSD for faster CSV ingestion
Recommended:
  • 8+ CPU cores
  • 16+ GB RAM
  • 50 GB free disk space
  • NVMe SSD for optimal I/O

Docker Deployment

Example Dockerfile:
FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENV MAX_MEMORY_MB=4096
ENV N_JOBS=4

CMD ["python", "NBA Data Preprocessing/task/run_pipeline.py", \
     "--input", "data/nba2k-full.csv", \
     "--config-template", "configs/pipeline.server.template.json"]
Run container:
docker build -t nba-pipeline:server .
docker run --cpus=4 --memory=8g \
  -v $(pwd)/artifacts:/app/artifacts_server \
  nba-pipeline:server

Cloud Deployment

AWS EC2

Recommended instance: m5.xlarge (4 vCPUs, 16 GB RAM)
# Launch instance
aws ec2 run-instances \
  --image-id ami-xxxxxxxxx \
  --instance-type m5.xlarge \
  --key-name your-key \
  --security-groups your-sg

# SSH and run
ssh -i your-key.pem ec2-user@instance-ip
cd nba-pipeline
python "NBA Data Preprocessing/task/run_pipeline.py" \
  --config-template configs/pipeline.server.template.json

GCP Compute Engine

Recommended machine: n2-standard-4 (4 vCPUs, 16 GB RAM)
gcloud compute instances create nba-pipeline-server \
  --machine-type=n2-standard-4 \
  --zone=us-central1-a \
  --image-family=ubuntu-2204-lts \
  --image-project=ubuntu-os-cloud

Energy and Telemetry

RAPL Energy Monitoring

On Intel/AMD servers, the pipeline uses RAPL (Running Average Power Limit) for accurate energy measurements:
# From engine.py:298-299
telemetry = self.hardware.compare(start_snapshot, end_snapshot)
telemetry['fallback_energy_estimate_j'] = total_elapsed * 30.0
Typical energy consumption:
  • Batch mode: ~45J per run
  • Streaming mode: ~30J per run
RAPL may be unavailable in:
  • Docker containers (unless privileged)
  • VMs without MSR access
  • ARM-based servers
Fallback estimation is used automatically.

Hardware Telemetry

Each run captures:
  • CPU utilization snapshots
  • Memory usage before/after
  • Energy consumption (RAPL or estimate)
  • Operator-level profiling

Deployment Checklist

1

Verify System Resources

# Check available memory
free -m

# Check CPU cores
nproc

# Check disk space
df -h
2

Install Dependencies

cd "NBA Data Preprocessing"
pip install -r requirements.txt
3

Run Validation Tests

cd "NBA Data Preprocessing/task"
python -m unittest discover -s test -p 'test_*.py'
4

Run Baseline Benchmark

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.server.template.json \
  --benchmark-runs 3
5

Analyze Performance

Review artifacts:
  • reports/pipeline_report.json - Overall metrics
  • benchmarks/constraint_experiment.csv - Performance sweep
  • benchmarks/*.png - Visualization plots
6

Tune for Production

Based on baseline:
  • Increase chunk_size if memory allows
  • Increase n_jobs for faster sweeps
  • Adjust benchmark_runs for desired confidence
7

Configure Persistent Storage

python run_pipeline.py \
  --input ../data/nba2k-full.csv \
  --config-template ../../configs/pipeline.server.template.json \
  --output-dir /mnt/nfs/nba_artifacts

Troubleshooting

Low Throughput

Symptom: Throughput below 500 rows/second on capable hardware Solutions:
  • Increase chunk_size to 512 or higher
  • Verify SSD storage (not HDD) for CSV ingestion
  • Check for background processes consuming CPU/memory
  • Disable spill_to_disk if accidentally enabled

Memory Pressure

Symptom: Unexpected memory exceeded warnings Solutions:
  • Verify actual available memory: free -m
  • Check for memory leaks in other processes
  • Increase max_memory_mb to 8192 or higher
  • Review chunk metrics in reports/streaming_chunks.jsonl

Timing Variance in Benchmarks

Symptom: High standard deviation in latency measurements Solutions:
  • Increase benchmark_runs to 10+ for more stable estimates
  • Reduce n_jobs to 1 for strict timing stability
  • Pin CPU affinity to avoid scheduler interference
  • Run during off-peak hours to reduce contention

RAPL Unavailable

Symptom: Energy measurements show fallback estimates Solutions:
  • Run with elevated privileges: sudo python run_pipeline.py ...
  • In Docker, use --privileged flag
  • In VMs, enable MSR access in hypervisor settings
  • Accept fallback estimates (still useful for relative comparisons)

Best Practices

Set spill_to_disk: false to avoid unnecessary I/O overhead. Server memory should be sufficient to hold all intermediate results.
Set n_jobs: 4 or higher to accelerate constraint experiments. This reduces total benchmark time by 3-4x.
Configure output_dir to point to NFS/S3 for long-term retention:
--output-dir /mnt/nfs/experiments/$(date +%Y%m%d_%H%M%S)
For production deployments, integrate with monitoring:
  • Prometheus for metrics collection
  • Grafana for visualization
  • Parse pipeline_report.json for custom dashboards
Validate deterministic behavior:
for i in {1..3}; do
  python run_pipeline.py --input ../data/nba2k-full.csv --random-seed 42
done
# Compare dataset_fingerprint in reports

Build docs developers (and LLMs) love