Measure and analyze model performance metrics for edge deployment optimization
Benchmarking is critical for evaluating edge AI models under realistic hardware constraints. The Edge AI Hardware Optimization framework provides comprehensive tools to measure accuracy, latency, memory usage, throughput, and energy consumption.
All metrics are returned in a structured dataclass defined in src/edge_opt/metrics.py:11-19:
from dataclasses import dataclass@dataclassclass PerfMetrics: accuracy: float # Classification accuracy (0.0 to 1.0) latency_ms: float # Mean latency in milliseconds latency_std_ms: float # Standard deviation of latency latency_p95_ms: float # 95th percentile latency throughput_sps: float # Throughput in samples per second memory_mb: float # Model memory footprint in MB energy_proxy_j: float # Energy proxy in joules
Measure classification accuracy on a validation dataset.
def evaluate_accuracy( model: nn.Module, loader: DataLoader, device: torch.device, precision: str = "fp32") -> float: """Evaluate model accuracy on a dataset. Args: model: PyTorch model to evaluate loader: DataLoader with validation data device: Device to run evaluation on (cpu/cuda) precision: Precision mode ('fp32', 'fp16', or 'int8') Returns: Accuracy as a float between 0.0 and 1.0 """
def measure_latency( model: nn.Module, sample_input: torch.Tensor, num_runs: int = 100, warmup: int = 10) -> float: """Measure model inference latency. Args: model: PyTorch model to benchmark sample_input: Sample input tensor num_runs: Number of inference runs to average warmup: Number of warmup runs before measurement Returns: Average latency in milliseconds """
Measure latency statistics across multiple benchmark repeats.
def measure_latency_distribution( model: nn.Module, sample_input: torch.Tensor, repeats: int = 5, num_runs: int = 100, warmup: int = 10) -> tuple[float, float, float]: """Measure latency distribution with statistics. Args: model: PyTorch model to benchmark sample_input: Sample input tensor repeats: Number of times to repeat measurement num_runs: Number of runs per repeat warmup: Number of warmup runs Returns: Tuple of (mean_ms, std_ms, p95_ms) """
The latency measurement function in src/edge_opt/metrics.py:39-48 implements proper benchmarking practices:
def measure_latency(model: nn.Module, sample_input: torch.Tensor, num_runs: int = 100, warmup: int = 10) -> float: model.eval() with torch.no_grad(): # Warmup phase: prime caches and stabilize CPU frequency for _ in range(warmup): _ = model(sample_input) # Timed measurement start = time.perf_counter() for _ in range(num_runs): _ = model(sample_input) elapsed = time.perf_counter() - start # Return average latency in milliseconds return (elapsed / num_runs) * 1000.0
Key aspects:
Warmup: Runs inference multiple times before measurement to prime CPU caches and stabilize frequency scaling
No gradients: Uses torch.no_grad() to disable gradient computation
High-precision timer: Uses time.perf_counter() for accurate timing
Averaging: Divides total time by number of runs for stable measurements
The measure_latency_distribution function in src/edge_opt/metrics.py:53-56 repeats measurements multiple times:
def measure_latency_distribution(model: nn.Module, sample_input: torch.Tensor, repeats: int = 5, num_runs: int = 100, warmup: int = 10) -> tuple[float, float, float]: # Measure latency multiple times latencies = [measure_latency(model, sample_input, num_runs=num_runs, warmup=warmup) for _ in range(repeats)] # Compute statistics latency_tensor = torch.tensor(latencies, dtype=torch.float32) return float(latency_tensor.mean()), float(latency_tensor.std(unbiased=False)), float(torch.quantile(latency_tensor, 0.95))
From configs/default.yaml, the default is benchmark_repeats: 5, which provides a good balance between measurement reliability and benchmarking time.
The repeats parameter (default 5) determines how many independent latency measurements are taken. Higher values provide more reliable statistics but increase benchmarking time linearly.
Comprehensive metric collection function that combines all measurements.
def collect_metrics( model: nn.Module, loader: DataLoader, device: torch.device, power_watts: float, precision: str, latency_multiplier: float = 1.0, benchmark_repeats: int = 5,) -> PerfMetrics: """Collect all performance metrics for a model. Args: model: PyTorch model to evaluate loader: DataLoader for evaluation device: Device to run on power_watts: Device power consumption for energy proxy precision: Precision mode ('fp32', 'fp16', 'int8') latency_multiplier: Scale factor for latency (e.g., CPU throttling) benchmark_repeats: Number of latency measurement repeats Returns: PerfMetrics dataclass with all metrics """
Number of times to repeat latency measurements for statistical analysis. Higher values provide more reliable statistics but increase benchmarking time.Recommended values:
The energy proxy provides an estimate of energy consumption per inference:
# From src/edge_opt/metrics.py:89energy_proxy = (latency / 1000.0) * power_watts
This assumes constant power draw during inference:
Convert latency from ms to seconds: latency / 1000.0
Multiply by power consumption: × power_watts
Result is in joules (J)
Example:
Latency: 10ms = 0.01s
Power: 5W
Energy: 0.01s × 5W = 0.05 J
The energy proxy is a simplified model. Real energy consumption varies with CPU utilization, memory bandwidth, and dynamic power states. For precise measurements, use hardware power monitors.
Use warmup runs: Always include warmup iterations before measurement to prime CPU caches and stabilize frequency scaling. The default of 10 warmup runs is usually sufficient.
Measure on target hardware: Benchmarks on development machines may not reflect edge device performance. Test on the actual deployment hardware when possible.
Avoid background processes: Close unnecessary applications during benchmarking. CPU throttling, background tasks, and thermal throttling can introduce noise in measurements.
Increase repeats for stability: If you see high standard deviation in latency measurements, increase benchmark_repeats to 10-20 for more reliable statistics.