Skip to main content

Overview

The deploy module provides simulation capabilities for testing model performance under real-world deployment conditions, including CPU frequency scaling and different inference patterns (batch vs. streaming).

Core Functions

deployment_simulation

Simulates model deployment with CPU frequency scaling and measures performance for both batch and streaming inference patterns.
def deployment_simulation(
    model: nn.Module,
    loader: DataLoader,
    cpu_frequency_scale: float,
    stream_items: int = 128
) -> dict[str, float]
model
nn.Module
required
PyTorch model to simulate deployment for
loader
DataLoader
required
DataLoader providing batches of data for simulation
cpu_frequency_scale
float
required
CPU frequency scaling factor (e.g., 0.5 for half speed, 1.0 for normal, 2.0 for double speed). Used to simulate different hardware performance levels.
stream_items
int
default:"128"
Number of individual items to process in streaming mode simulation
return
dict[str, float]
Dictionary containing deployment metrics:
  • cpu_frequency_scale: Applied CPU frequency scaling factor
  • latency_multiplier: Computed latency multiplier (1.0 / cpu_frequency_scale)
  • batch_latency_ms: Total latency for processing one full batch in milliseconds
  • batch_throughput_sps: Batch processing throughput in samples per second
  • stream_avg_latency_ms: Average latency per item in streaming mode in milliseconds
  • stream_throughput_sps: Streaming throughput in samples per second
from edge_opt.deploy import deployment_simulation
import torch
from torch.utils.data import DataLoader

# Simulate deployment on slower hardware (50% CPU speed)
deployment_results = deployment_simulation(
    model=my_model,
    loader=test_loader,
    cpu_frequency_scale=0.5,  # 50% of normal CPU speed
    stream_items=128
)

print(f"CPU Scale: {deployment_results['cpu_frequency_scale']}")
print(f"Batch Latency: {deployment_results['batch_latency_ms']:.2f} ms")
print(f"Batch Throughput: {deployment_results['batch_throughput_sps']:.2f} samples/sec")
print(f"Stream Avg Latency: {deployment_results['stream_avg_latency_ms']:.2f} ms")
print(f"Stream Throughput: {deployment_results['stream_throughput_sps']:.2f} samples/sec")

Understanding the Results

Latency Multiplier

The latency multiplier is computed as 1.0 / cpu_frequency_scale and is applied to measured execution times:
  • cpu_frequency_scale = 0.5latency_multiplier = 2.0 (2× slower)
  • cpu_frequency_scale = 1.0latency_multiplier = 1.0 (baseline)
  • cpu_frequency_scale = 2.0latency_multiplier = 0.5 (2× faster)

Batch vs. Streaming Inference

Batch Inference: Processes an entire batch in a single forward pass. This is typically more efficient for throughput but has higher latency per sample.Streaming Inference: Processes items one at a time, simulating real-time inference scenarios. Lower throughput but more realistic for latency-critical applications.

Use Cases

# Test if model meets latency requirements on target hardware
min_cpu_scale = 0.4  # Low-end device at 40% of development machine
max_latency_ms = 50.0

results = deployment_simulation(
    model=model,
    loader=test_loader,
    cpu_frequency_scale=min_cpu_scale
)

if results['stream_avg_latency_ms'] <= max_latency_ms:
    print("✓ Model meets latency requirements")
else:
    print(f"✗ Model latency {results['stream_avg_latency_ms']:.2f} ms exceeds {max_latency_ms} ms")
The simulation uses torch.no_grad() to disable gradient computation and sets the model to evaluation mode. All timing measurements use time.perf_counter() for high-resolution timing.

Performance Considerations

  • The function measures actual execution time and applies the frequency scaling multiplier
  • Warmup is not explicitly performed; consider running the function multiple times for more stable measurements
  • Streaming simulation processes items sequentially with unsqueeze(0) to simulate single-item batches
  • The batch used for simulation is taken from the first batch in the provided DataLoader

Build docs developers (and LLMs) love