Skip to main content

Overview

The inference pipeline allows you to load trained model checkpoints and run predictions on new data. It supports multiple precision modes and generates detailed performance reports.

Loading Checkpoints

Checkpoints are stored in .npz format (NumPy compressed archives). Load them using the load_weights method:
from student import NeuralNetwork
from config import PrecisionConfig

# Initialize model with same architecture as training
layer_sizes = [784, 64, 10]
activations = ["relu", "softmax"]
cfg = PrecisionConfig(train_dtype="float32", infer_precision="float32", seed=42)

model = NeuralNetwork(
    layer_sizes=layer_sizes,
    activations=activations,
    precision_config=cfg
)

# Load weights from checkpoint
model.load_weights("path/to/checkpoint.npz")
Ensure the model architecture (layer sizes and activations) matches the checkpoint you’re loading.

Running Inference

Use the run_inference function from deployment.py for basic inference:
import numpy as np
from deployment import run_inference

# Prepare input data
X = np.random.randn(64, 784).astype(np.float32)

# Run inference
predictions = run_inference(model, X, precision="float32")

Precision Modes

Supported precision modes:
  • float32: Full precision (default)
  • float16: Half precision for faster inference
  • int8: Quantized inference (experimental)
# Use half precision for faster inference
predictions = run_inference(model, X, precision="float16")

CLI Inference Tool

The inference.py script provides a command-line interface:
python inference.py \
  --weights checkpoints/model.npz \
  --precision float32 \
  --batch-size 64

CLI Arguments

ArgumentTypeDefaultDescription
--weightsstringrequiredPath to .npz checkpoint file
--precisionstringfloat32Precision mode: float32, float16, or int8
--batch-sizeint64Number of samples per inference batch
--export-onnxflagfalseExport model to ONNX format after inference

Example Output

{
  "precision": "float32",
  "latency_per_sample_s": 0.00012345,
  "throughput_samples_per_s": 8100.5
}

Performance Benchmarking

Generate detailed inference reports using inference_report:
from deployment import inference_report

report = inference_report(model, X, precision="float32")
print(report)
The report includes:
  • Latency per sample: Average time per sample (seconds)
  • Throughput: Samples processed per second
  • Precision mode: Active precision configuration

Latency Measurement

Latency is measured over multiple runs (default: 10) and averaged:
from deployment import measure_inference_latency

latency = measure_inference_latency(
    model,
    X,
    precision="float32",
    runs=10
)
print(f"Average latency: {latency:.8f}s per sample")

Throughput Measurement

Throughput calculates samples per second:
from deployment import batch_inference_throughput

throughput = batch_inference_throughput(
    model,
    X,
    precision="float32",
    runs=10
)
print(f"Throughput: {throughput:.3f} samples/s")

Saving Checkpoints

Export model weights to NumPy checkpoint format:
from deployment import export_numpy_checkpoint

checkpoint_path = export_numpy_checkpoint(
    model,
    "exports/model_checkpoint.npz"
)
print(f"Checkpoint saved to: {checkpoint_path}")
The checkpoint contains all layer weights and biases in compressed format.

Best Practices

Reproducibility: Set a global seed before inference for deterministic results:
from reproducibility import set_global_seed
set_global_seed(42)
Batch Size Optimization: Larger batch sizes improve throughput but increase memory usage. Test different sizes to find the optimal balance.
Always validate that the model architecture matches the checkpoint before loading weights to avoid shape mismatches.

Next Steps

ONNX Export

Export models to ONNX format for cross-framework deployment

PyTorch Comparison

Compare performance against PyTorch implementations

Build docs developers (and LLMs) love