Overview
The inference pipeline allows you to load trained model checkpoints and run predictions on new data. It supports multiple precision modes and generates detailed performance reports.
Loading Checkpoints
Checkpoints are stored in .npz format (NumPy compressed archives). Load them using the load_weights method:
from student import NeuralNetwork
from config import PrecisionConfig
# Initialize model with same architecture as training
layer_sizes = [ 784 , 64 , 10 ]
activations = [ "relu" , "softmax" ]
cfg = PrecisionConfig( train_dtype = "float32" , infer_precision = "float32" , seed = 42 )
model = NeuralNetwork(
layer_sizes = layer_sizes,
activations = activations,
precision_config = cfg
)
# Load weights from checkpoint
model.load_weights( "path/to/checkpoint.npz" )
Ensure the model architecture (layer sizes and activations) matches the checkpoint you’re loading.
Running Inference
Use the run_inference function from deployment.py for basic inference:
import numpy as np
from deployment import run_inference
# Prepare input data
X = np.random.randn( 64 , 784 ).astype(np.float32)
# Run inference
predictions = run_inference(model, X, precision = "float32" )
Precision Modes
Supported precision modes:
float32: Full precision (default)
float16: Half precision for faster inference
int8: Quantized inference (experimental)
# Use half precision for faster inference
predictions = run_inference(model, X, precision = "float16" )
The inference.py script provides a command-line interface:
python inference.py \
--weights checkpoints/model.npz \
--precision float32 \
--batch-size 64
CLI Arguments
Argument Type Default Description --weightsstring required Path to .npz checkpoint file --precisionstring float32Precision mode: float32, float16, or int8 --batch-sizeint 64Number of samples per inference batch --export-onnxflag false Export model to ONNX format after inference
Example Output
{
"precision" : "float32" ,
"latency_per_sample_s" : 0.00012345 ,
"throughput_samples_per_s" : 8100.5
}
Generate detailed inference reports using inference_report:
from deployment import inference_report
report = inference_report(model, X, precision = "float32" )
print (report)
The report includes:
Latency per sample : Average time per sample (seconds)
Throughput : Samples processed per second
Precision mode : Active precision configuration
Latency Measurement
Latency is measured over multiple runs (default: 10) and averaged:
from deployment import measure_inference_latency
latency = measure_inference_latency(
model,
X,
precision = "float32" ,
runs = 10
)
print ( f "Average latency: { latency :.8f} s per sample" )
Throughput Measurement
Throughput calculates samples per second:
from deployment import batch_inference_throughput
throughput = batch_inference_throughput(
model,
X,
precision = "float32" ,
runs = 10
)
print ( f "Throughput: { throughput :.3f} samples/s" )
Saving Checkpoints
Export model weights to NumPy checkpoint format:
from deployment import export_numpy_checkpoint
checkpoint_path = export_numpy_checkpoint(
model,
"exports/model_checkpoint.npz"
)
print ( f "Checkpoint saved to: { checkpoint_path } " )
The checkpoint contains all layer weights and biases in compressed format.
Best Practices
Reproducibility : Set a global seed before inference for deterministic results:from reproducibility import set_global_seed
set_global_seed( 42 )
Batch Size Optimization : Larger batch sizes improve throughput but increase memory usage. Test different sizes to find the optimal balance.
Always validate that the model architecture matches the checkpoint before loading weights to avoid shape mismatches.
Next Steps
ONNX Export Export models to ONNX format for cross-framework deployment
PyTorch Comparison Compare performance against PyTorch implementations