Hardware Profiling

Overview

The hardware profiling module estimates computational costs, memory requirements, and resource utilization for model inference. It provides layer-by-layer operator analysis and quantization tradeoffs to optimize deployment on resource-constrained hardware.

Core Functions

build_hardware_profile_table

def build_hardware_profile_table(
    feature_count: int, 
    batch_size: int, 
    stream_interval_ms: int
) -> dict:
    """
    Build a comprehensive hardware performance profile.
    
    Args:
        feature_count: Number of input features
        batch_size: Inference batch size
        stream_interval_ms: Time window for streaming data
    
    Returns:
        Dictionary containing operator profiles, totals, precision 
        tradeoffs, and edge deployment constraints
    """

Source: evaluation/hardware_profile.py:18

write_hardware_profile_artifacts

def write_hardware_profile_artifacts(
    profile: dict, 
    output_dir: Path
) -> dict[str, str]:
    """
    Write hardware profile data to CSV files.
    
    Args:
        profile: Profile dictionary from build_hardware_profile_table
        output_dir: Directory to write CSV artifacts
    
    Returns:
        Dictionary mapping artifact names to file paths
    """

Source: evaluation/hardware_profile.py:48

Usage Example

from pathlib import Path
from evaluation.hardware_profile import (
    build_hardware_profile_table,
    write_hardware_profile_artifacts
)

# Define deployment parameters
feature_count = 42  # Number of model features
batch_size = 32     # Inference batch size
stream_interval_ms = 1000  # 1 second streaming window

# Generate hardware profile
profile = build_hardware_profile_table(
    feature_count=feature_count,
    batch_size=batch_size,
    stream_interval_ms=stream_interval_ms
)

# Inspect results
print(f"Total latency: {profile['totals']['latency_ms']:.2f} ms")
print(f"Total memory: {profile['totals']['memory_kb']:.2f} KB")
print(f"Bandwidth: {profile['totals']['estimated_bandwidth_mb_s']:.2f} MB/s")
print(f"Utilization: {profile['totals']['stream_utilization']:.2%}")

# Write to disk
artifacts = write_hardware_profile_artifacts(
    profile=profile,
    output_dir=Path('artifacts/hardware')
)

print(f"Operator profile: {artifacts['operator_profile_csv']}")
print(f"Hardware totals: {artifacts['hardware_totals_csv']}")

Profile Structure

The hardware profile contains four main sections:

1. operator_profile

List of operator-level metrics. Each operator has:

operator: Operation name (e.g., input_normalization, linear_projection)
latency_ms: Estimated latency in milliseconds
memory_kb: Memory footprint in kilobytes

Example:

[
    {"operator": "input_normalization", "latency_ms": 0.32, "memory_kb": 10.5},
    {"operator": "linear_projection", "latency_ms": 0.96, "memory_kb": 31.5},
    {"operator": "activation", "latency_ms": 0.256, "memory_kb": 10.5},
    {"operator": "decision_head", "latency_ms": 0.384, "memory_kb": 0.25}
]

2. totals

Aggregated metrics across all operators:

latency_ms: Sum of all operator latencies
memory_kb: Sum of all operator memory usage
estimated_bandwidth_mb_s: Memory bandwidth estimate
stream_utilization: Fraction of stream interval used for compute

Bandwidth calculation (hardware_profile.py:24):

bytes_moved = total_memory_kb * 1024
bandwidth_mb_s = (bytes_moved / (1024 * 1024)) / max(total_latency_ms / 1000, 1e-9)

Utilization calculation (hardware_profile.py:25):

utilization = min(1.0, total_latency_ms / max(stream_interval_ms, 1))

3. precision_tradeoffs

Memory savings from quantization:

fp32_memory_kb: Full precision (baseline)
fp16_memory_kb: Half precision (50% reduction)
int8_memory_kb: 8-bit integer (75% reduction)
note: Warning about deployment-specific latency effects

Example:

{
    "fp32_memory_kb": 52.75,
    "fp16_memory_kb": 26.375,
    "int8_memory_kb": 13.1875,
    "note": "Latency effects from quantization are deployment dependent..."
}

4. edge_constraints

Deployment considerations:

cache_sensitivity: Impact of small batch sizes on cache efficiency
bottleneck: Description of primary performance bottleneck

Operator Cost Estimation

Layer costs are estimated using empirical formulas (hardware_profile.py:8-15):

def _estimate_layer_costs(feature_count: int, batch_size: int) -> pd.DataFrame:
    rows = [
        {"operator": "input_normalization", 
         "latency_ms": 0.01 * batch_size, 
         "memory_kb": feature_count * batch_size * 8 / 1024},
        {"operator": "linear_projection", 
         "latency_ms": 0.03 * batch_size, 
         "memory_kb": feature_count * batch_size * 12 / 1024},
        {"operator": "activation", 
         "latency_ms": 0.008 * batch_size, 
         "memory_kb": feature_count * batch_size * 4 / 1024},
        {"operator": "decision_head", 
         "latency_ms": 0.012 * batch_size, 
         "memory_kb": batch_size * 8 / 1024},
    ]
    return pd.DataFrame(rows)

Note: These are estimates. Actual performance varies by hardware, compiler optimizations, and model architecture.

Hardware Utilities

The utils/hardware.py module provides helper functions:

HardwareProfile Dataclass

@dataclass
class HardwareProfile:
    memory_limit_mb: int
    compute_budget: int

Source: utils/hardware.py:6

estimate_batch_memory_mb

def estimate_batch_memory_mb(
    batch_size: int, 
    feature_count: int, 
    bytes_per_feature: int = 8
) -> float:
    return (batch_size * feature_count * bytes_per_feature) / (1024 * 1024)

Source: utils/hardware.py:12 Estimates memory usage for a batch in megabytes. Default assumes 8 bytes per feature (fp64).

auto_adjust_batch_size

def auto_adjust_batch_size(
    initial_batch: int, 
    feature_count: int, 
    profile: HardwareProfile
) -> int:
    batch = initial_batch
    while batch > 1 and estimate_batch_memory_mb(batch, feature_count) > profile.memory_limit_mb:
        batch //= 2
    return max(1, batch)

Source: utils/hardware.py:16 Automatically reduces batch size to fit memory constraints. Uses binary search (halving). Usage example:

from utils.hardware import HardwareProfile, auto_adjust_batch_size

profile = HardwareProfile(memory_limit_mb=128, compute_budget=1000)
adjusted_batch = auto_adjust_batch_size(
    initial_batch=64,
    feature_count=42,
    profile=profile
)
print(f"Adjusted batch size: {adjusted_batch}")

compute_utilization

def compute_utilization(
    operations: int, 
    profile: HardwareProfile
) -> float:
    return min(1.0, operations / max(profile.compute_budget, 1))

Source: utils/hardware.py:23 Calculates compute utilization as a fraction (0.0 to 1.0).

CSV Artifacts

The write_hardware_profile_artifacts function generates two CSV files:

operator_profile.csv

operator,latency_ms,memory_kb
input_normalization,0.32,10.5
linear_projection,0.96,31.5
activation,0.256,10.5
decision_head,0.384,0.25

hardware_totals.csv

latency_ms,memory_kb,estimated_bandwidth_mb_s,stream_utilization
1.92,52.75,26.8,0.00192

Optimization Workflow

Profile baseline with production batch size and feature count
Identify bottleneck from edge_constraints
Evaluate quantization using precision_tradeoffs
Adjust batch size with auto_adjust_batch_size
Monitor utilization to avoid over/under-provisioning

from utils.hardware import HardwareProfile, auto_adjust_batch_size
from evaluation.hardware_profile import build_hardware_profile_table

# Define hardware constraints
hw_profile = HardwareProfile(memory_limit_mb=256, compute_budget=5000)

# Optimize batch size
optimal_batch = auto_adjust_batch_size(
    initial_batch=128,
    feature_count=42,
    profile=hw_profile
)

# Profile with optimized batch
profile = build_hardware_profile_table(
    feature_count=42,
    batch_size=optimal_batch,
    stream_interval_ms=1000
)

if profile['totals']['stream_utilization'] < 0.5:
    print("⚠ Underutilized - consider increasing batch size")
elif profile['totals']['stream_utilization'] > 0.9:
    print("⚠ Overutilized - reduce batch size or increase stream interval")

Quantization Recommendations

Based on memory constraints:

profile = build_hardware_profile_table(42, 32, 1000)
fp32_mem = profile['precision_tradeoffs']['fp32_memory_kb']
fp16_mem = profile['precision_tradeoffs']['fp16_memory_kb']
int8_mem = profile['precision_tradeoffs']['int8_memory_kb']

if fp32_mem > 100:  # KB
    print(f"Consider FP16 quantization: {fp32_mem:.1f} KB → {fp16_mem:.1f} KB")
if fp32_mem > 200:
    print(f"Consider INT8 quantization: {fp32_mem:.1f} KB → {int8_mem:.1f} KB")

Edge Deployment

For edge devices (Raspberry Pi, mobile, IoT):

Small batches: Process 1-4 samples at a time
FP16 or INT8: Reduce memory footprint
Monitor cache: Small batches reduce cache reuse
Bandwidth awareness: Memory movement often dominates latency

# Edge device profile
edge_profile = build_hardware_profile_table(
    feature_count=42,
    batch_size=1,  # Process one sample at a time
    stream_interval_ms=100  # 10 Hz
)

print(f"Per-sample latency: {edge_profile['totals']['latency_ms']:.3f} ms")
print(f"Memory: {edge_profile['precision_tradeoffs']['int8_memory_kb']:.2f} KB (INT8)")

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

Overview

Core Functions

build_hardware_profile_table

write_hardware_profile_artifacts

Usage Example

Profile Structure

1. operator_profile

2. totals

3. precision_tradeoffs

4. edge_constraints

Operator Cost Estimation

Hardware Utilities

HardwareProfile Dataclass

estimate_batch_memory_mb

auto_adjust_batch_size

compute_utilization

CSV Artifacts

operator_profile.csv

hardware_totals.csv

Optimization Workflow

Quantization Recommendations

Edge Deployment

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Data Pipeline

Modeling

Real-time Processing

Deployment

Operations

​Overview

​Core Functions

​build_hardware_profile_table

​write_hardware_profile_artifacts

​Usage Example

​Profile Structure

​1. operator_profile

​2. totals

​3. precision_tradeoffs

​4. edge_constraints

​Operator Cost Estimation

​Hardware Utilities

​HardwareProfile Dataclass

​estimate_batch_memory_mb

​auto_adjust_batch_size

​compute_utilization

​CSV Artifacts

​operator_profile.csv

​hardware_totals.csv

​Optimization Workflow

​Quantization Recommendations

​Edge Deployment

Build docs developers (and LLMs) love

Overview

Core Functions

build_hardware_profile_table

write_hardware_profile_artifacts

Usage Example

Profile Structure

1. operator_profile

2. totals

3. precision_tradeoffs

4. edge_constraints

Operator Cost Estimation

Hardware Utilities

HardwareProfile Dataclass

estimate_batch_memory_mb

auto_adjust_batch_size

compute_utilization

CSV Artifacts

operator_profile.csv

hardware_totals.csv

Optimization Workflow

Quantization Recommendations

Edge Deployment