Deploy

Overview

The deploy module provides simulation capabilities for testing model performance under real-world deployment conditions, including CPU frequency scaling and different inference patterns (batch vs. streaming).

Core Functions

deployment_simulation

Simulates model deployment with CPU frequency scaling and measures performance for both batch and streaming inference patterns.

def deployment_simulation(
    model: nn.Module,
    loader: DataLoader,
    cpu_frequency_scale: float,
    stream_items: int = 128
) -> dict[str, float]

model

nn.Module

required

PyTorch model to simulate deployment for

loader

DataLoader

required

DataLoader providing batches of data for simulation

cpu_frequency_scale

float

required

CPU frequency scaling factor (e.g., 0.5 for half speed, 1.0 for normal, 2.0 for double speed). Used to simulate different hardware performance levels.

stream_items

int

default:"128"

Number of individual items to process in streaming mode simulation

return

dict[str, float]

Dictionary containing deployment metrics:

cpu_frequency_scale: Applied CPU frequency scaling factor
latency_multiplier: Computed latency multiplier (1.0 / cpu_frequency_scale)
batch_latency_ms: Total latency for processing one full batch in milliseconds
batch_throughput_sps: Batch processing throughput in samples per second
stream_avg_latency_ms: Average latency per item in streaming mode in milliseconds
stream_throughput_sps: Streaming throughput in samples per second

from edge_opt.deploy import deployment_simulation
import torch
from torch.utils.data import DataLoader

# Simulate deployment on slower hardware (50% CPU speed)
deployment_results = deployment_simulation(
    model=my_model,
    loader=test_loader,
    cpu_frequency_scale=0.5,  # 50% of normal CPU speed
    stream_items=128
)

print(f"CPU Scale: {deployment_results['cpu_frequency_scale']}")
print(f"Batch Latency: {deployment_results['batch_latency_ms']:.2f} ms")
print(f"Batch Throughput: {deployment_results['batch_throughput_sps']:.2f} samples/sec")
print(f"Stream Avg Latency: {deployment_results['stream_avg_latency_ms']:.2f} ms")
print(f"Stream Throughput: {deployment_results['stream_throughput_sps']:.2f} samples/sec")

Understanding the Results

Latency Multiplier

The latency multiplier is computed as 1.0 / cpu_frequency_scale and is applied to measured execution times:

cpu_frequency_scale = 0.5 → latency_multiplier = 2.0 (2× slower)
cpu_frequency_scale = 1.0 → latency_multiplier = 1.0 (baseline)
cpu_frequency_scale = 2.0 → latency_multiplier = 0.5 (2× faster)

Batch vs. Streaming Inference

Batch Inference: Processes an entire batch in a single forward pass. This is typically more efficient for throughput but has higher latency per sample.Streaming Inference: Processes items one at a time, simulating real-time inference scenarios. Lower throughput but more realistic for latency-critical applications.

Use Cases

# Test if model meets latency requirements on target hardware
min_cpu_scale = 0.4  # Low-end device at 40% of development machine
max_latency_ms = 50.0

results = deployment_simulation(
    model=model,
    loader=test_loader,
    cpu_frequency_scale=min_cpu_scale
)

if results['stream_avg_latency_ms'] <= max_latency_ms:
    print("✓ Model meets latency requirements")
else:
    print(f"✗ Model latency {results['stream_avg_latency_ms']:.2f} ms exceeds {max_latency_ms} ms")

The simulation uses torch.no_grad() to disable gradient computation and sets the model to evaluation mode. All timing measurements use time.perf_counter() for high-resolution timing.

Performance Considerations

The function measures actual execution time and applies the frequency scaling multiplier
Warmup is not explicitly performed; consider running the function multiple times for more stable measurements
Streaming simulation processes items sequentially with unsqueeze(0) to simulate single-item batches
The batch used for simulation is taken from the first batch in the provided DataLoader

Core Modules

Optimization

Analysis

Overview

Core Functions

deployment_simulation

Understanding the Results

Latency Multiplier

Batch vs. Streaming Inference

Use Cases

Performance Considerations

Build docs developers (and LLMs) love

Core Modules

Optimization

Analysis

​Overview

​Core Functions

​deployment_simulation

​Understanding the Results

​Latency Multiplier

​Batch vs. Streaming Inference

​Use Cases

​Performance Considerations

Build docs developers (and LLMs) love

Overview

Core Functions

deployment_simulation

Understanding the Results

Latency Multiplier

Batch vs. Streaming Inference

Use Cases

Performance Considerations