Overview
The deploy module provides simulation capabilities for testing model performance under real-world deployment conditions, including CPU frequency scaling and different inference patterns (batch vs. streaming).Core Functions
deployment_simulation
Simulates model deployment with CPU frequency scaling and measures performance for both batch and streaming inference patterns.PyTorch model to simulate deployment for
DataLoader providing batches of data for simulation
CPU frequency scaling factor (e.g., 0.5 for half speed, 1.0 for normal, 2.0 for double speed). Used to simulate different hardware performance levels.
Number of individual items to process in streaming mode simulation
Dictionary containing deployment metrics:
cpu_frequency_scale: Applied CPU frequency scaling factorlatency_multiplier: Computed latency multiplier (1.0 / cpu_frequency_scale)batch_latency_ms: Total latency for processing one full batch in millisecondsbatch_throughput_sps: Batch processing throughput in samples per secondstream_avg_latency_ms: Average latency per item in streaming mode in millisecondsstream_throughput_sps: Streaming throughput in samples per second
Understanding the Results
Latency Multiplier
The latency multiplier is computed as1.0 / cpu_frequency_scale and is applied to measured execution times:
cpu_frequency_scale = 0.5→latency_multiplier = 2.0(2× slower)cpu_frequency_scale = 1.0→latency_multiplier = 1.0(baseline)cpu_frequency_scale = 2.0→latency_multiplier = 0.5(2× faster)
Batch vs. Streaming Inference
Batch Inference: Processes an entire batch in a single forward pass. This is typically more efficient for throughput but has higher latency per sample.Streaming Inference: Processes items one at a time, simulating real-time inference scenarios. Lower throughput but more realistic for latency-critical applications.
Use Cases
The simulation uses
torch.no_grad() to disable gradient computation and sets the model to evaluation mode. All timing measurements use time.perf_counter() for high-resolution timing.Performance Considerations
- The function measures actual execution time and applies the frequency scaling multiplier
- Warmup is not explicitly performed; consider running the function multiple times for more stable measurements
- Streaming simulation processes items sequentially with
unsqueeze(0)to simulate single-item batches - The batch used for simulation is taken from the first batch in the provided DataLoader