Inference

Overview

The inference pipeline allows you to load trained model checkpoints and run predictions on new data. It supports multiple precision modes and generates detailed performance reports.

Loading Checkpoints

Checkpoints are stored in .npz format (NumPy compressed archives). Load them using the load_weights method:

from student import NeuralNetwork
from config import PrecisionConfig

# Initialize model with same architecture as training
layer_sizes = [784, 64, 10]
activations = ["relu", "softmax"]
cfg = PrecisionConfig(train_dtype="float32", infer_precision="float32", seed=42)

model = NeuralNetwork(
    layer_sizes=layer_sizes,
    activations=activations,
    precision_config=cfg
)

# Load weights from checkpoint
model.load_weights("path/to/checkpoint.npz")

Ensure the model architecture (layer sizes and activations) matches the checkpoint you’re loading.

Running Inference

Use the run_inference function from deployment.py for basic inference:

import numpy as np
from deployment import run_inference

# Prepare input data
X = np.random.randn(64, 784).astype(np.float32)

# Run inference
predictions = run_inference(model, X, precision="float32")

Precision Modes

Supported precision modes:

float32: Full precision (default)
float16: Half precision for faster inference
int8: Quantized inference (experimental)

# Use half precision for faster inference
predictions = run_inference(model, X, precision="float16")

CLI Inference Tool

The inference.py script provides a command-line interface:

python inference.py \
  --weights checkpoints/model.npz \
  --precision float32 \
  --batch-size 64

CLI Arguments

Argument	Type	Default	Description
`--weights`	string	required	Path to `.npz` checkpoint file
`--precision`	string	`float32`	Precision mode: `float32`, `float16`, or `int8`
`--batch-size`	int	`64`	Number of samples per inference batch
`--export-onnx`	flag	false	Export model to ONNX format after inference

Example Output

{
  "precision": "float32",
  "latency_per_sample_s": 0.00012345,
  "throughput_samples_per_s": 8100.5
}

Performance Benchmarking

Generate detailed inference reports using inference_report:

from deployment import inference_report

report = inference_report(model, X, precision="float32")
print(report)

The report includes:

Latency per sample: Average time per sample (seconds)
Throughput: Samples processed per second
Precision mode: Active precision configuration

Latency Measurement

Latency is measured over multiple runs (default: 10) and averaged:

from deployment import measure_inference_latency

latency = measure_inference_latency(
    model,
    X,
    precision="float32",
    runs=10
)
print(f"Average latency: {latency:.8f}s per sample")

Throughput Measurement

Throughput calculates samples per second:

from deployment import batch_inference_throughput

throughput = batch_inference_throughput(
    model,
    X,
    precision="float32",
    runs=10
)
print(f"Throughput: {throughput:.3f} samples/s")

Saving Checkpoints

Export model weights to NumPy checkpoint format:

from deployment import export_numpy_checkpoint

checkpoint_path = export_numpy_checkpoint(
    model,
    "exports/model_checkpoint.npz"
)
print(f"Checkpoint saved to: {checkpoint_path}")

The checkpoint contains all layer weights and biases in compressed format.

Best Practices

Reproducibility: Set a global seed before inference for deterministic results:

from reproducibility import set_global_seed
set_global_seed(42)

Batch Size Optimization: Larger batch sizes improve throughput but increase memory usage. Test different sizes to find the optimal balance.

Always validate that the model architecture matches the checkpoint before loading weights to avoid shape mismatches.

Get Started

Core Concepts

Training & Experiments

Analysis & Profiling

Deployment

Overview

Loading Checkpoints

Running Inference

Precision Modes

CLI Inference Tool

CLI Arguments

Example Output

Performance Benchmarking

Latency Measurement

Throughput Measurement

Saving Checkpoints

Best Practices

Next Steps

ONNX Export

PyTorch Comparison

Build docs developers (and LLMs) love

Get Started

Core Concepts

Training & Experiments

Analysis & Profiling

Deployment

​Overview

​Loading Checkpoints

​Running Inference

​Precision Modes

​CLI Inference Tool

​CLI Arguments

​Example Output

​Performance Benchmarking

​Latency Measurement

​Throughput Measurement

​Saving Checkpoints

​Best Practices

​Next Steps

ONNX Export

PyTorch Comparison

Build docs developers (and LLMs) love

Overview

Loading Checkpoints

Running Inference

Precision Modes

CLI Inference Tool

CLI Arguments

Example Output

Performance Benchmarking

Latency Measurement

Throughput Measurement

Saving Checkpoints

Best Practices

Next Steps