Overview
The comparison framework benchmarks your scratch NumPy implementation against an equivalent PyTorch model. This helps validate correctness and understand performance characteristics across frameworks.
PyTorch Model Equivalent
The pytorch_model.py module provides a PyTorch implementation with identical architecture:
from pytorch_model import TorchNeuralNetwork, is_torch_available
if is_torch_available():
model = TorchNeuralNetwork(
layer_sizes = [ 784 , 64 , 10 ],
activations = [ "relu" , "softmax" ],
seed = 42
)
Architecture Compatibility
The PyTorch model mirrors the NumPy implementation:
Layer structure : Sequential fully connected layers
Activations : ReLU, Sigmoid, Softmax, Linear
Weight initialization : Consistent seeding for reproducibility
Training loop : SGD optimizer with mini-batches
Running Comparisons
Use the compare.py script to benchmark both implementations:
from compare import benchmark_scratch_vs_torch
results = benchmark_scratch_vs_torch(
layer_sizes = [ 784 , 64 , 10 ],
activations = [ "relu" , "softmax" ],
n_samples = 512 ,
epochs = 3 ,
batch_size = 32 ,
alpha = 0.1 ,
seed = 42
)
Run the comparison from the command line:
This executes a full benchmark and generates comparison reports.
Metrics Collected
The comparison framework measures:
Metric Description Unit train_time_per_epoch_sAverage training time per epoch seconds inference_latency_per_sample_sAverage inference time per sample seconds batch_throughput_samples_per_sSamples processed per second samples/s peak_memory_mbPeak memory usage during training MB final_accuracyModel accuracy after training 0-1
Example Results
[
{
"framework" : "scratch_numpy" ,
"status" : "ok" ,
"layer_sizes" : "784x64x10" ,
"train_time_per_epoch_s" : 0.125 ,
"inference_latency_per_sample_s" : 0.000123 ,
"batch_throughput_samples_per_s" : 8130.5 ,
"peak_memory_mb" : 45.2 ,
"final_accuracy" : 0.89
},
{
"framework" : "pytorch" ,
"status" : "ok" ,
"layer_sizes" : "784x64x10" ,
"train_time_per_epoch_s" : 0.098 ,
"inference_latency_per_sample_s" : 0.000087 ,
"batch_throughput_samples_per_s" : 11494.3 ,
"peak_memory_mb" : 128.5 ,
"final_accuracy" : 0.90
}
]
Comparison results are saved in multiple formats:
JSON Report
// benchmarks/comparison/comparison_metrics.json
[
{
"framework" : "scratch_numpy" ,
"status" : "ok" ,
"train_time_per_epoch_s" : 0.125 ,
...
}
]
CSV Export
framework, status, train_time_per_epoch_s, inference_latency_per_sample_s, ...
scratch_numpy, ok, 0.125, 0.000123, ...
pytorch, ok, 0.098, 0.000087, ...
Visual Comparison
A comparison plot is generated at benchmarks/comparison/comparison_summary.png showing:
Training time comparison
Inference latency comparison
Memory usage comparison
Final accuracy comparison
Training Comparison
Both models are trained with identical settings:
# NumPy implementation
from student import NeuralNetwork
scratch_model = NeuralNetwork(
layer_sizes = layer_sizes,
activations = activations,
precision_config = cfg
)
history = scratch_model.fit(
X, y,
epochs = 3 ,
alpha = 0.1 ,
batch_size = 32 ,
seed = 42
)
# PyTorch implementation
from pytorch_model import TorchNeuralNetwork
torch_model = TorchNeuralNetwork(
layer_sizes = layer_sizes,
activations = activations,
seed = 42
)
history = torch_model.fit(
X, y,
epochs = 3 ,
alpha = 0.1 ,
batch_size = 32 ,
seed = 42
)
Training Parameters
Optimizer : Stochastic Gradient Descent (SGD)
Loss function : Mean Squared Error (MSE)
Batch processing : Mini-batch gradient descent
Shuffling : Random shuffle each epoch
Seed : Fixed for reproducibility
Inference Comparison
Inference benchmarks measure both implementations:
NumPy Inference
from benchmark import measure_inference_latency_per_sample
latency = measure_inference_latency_per_sample(
scratch_model,
X,
precision = "float32"
)
PyTorch Inference
def _measure_torch_inference_latency_per_sample ( model , X , precision = "float32" , runs = 5 ):
times = []
for _ in range (runs):
t0 = time.perf_counter()
model.forward(X, training = False , precision = precision)
times.append(time.perf_counter() - t0)
return np.mean(times) / X.shape[ 0 ]
Inference benchmarks include a warmup run to avoid cold-start overhead.
Memory Profiling
Peak memory usage is measured using tracemalloc:
from benchmark import _measure_peak_memory_mb
(result, peak_memory) = _measure_peak_memory_mb(
lambda : model.fit(X, y, epochs = 3 , alpha = 0.1 , batch_size = 32 , seed = 42 )
)
print ( f "Peak memory: { peak_memory :.2f} MB" )
Memory Considerations
NumPy : Lower memory footprint, no framework overhead
PyTorch : Higher memory due to autograd graph and CUDA context
Handling Missing PyTorch
The comparison gracefully handles missing PyTorch:
if not is_torch_available():
results.append({
"framework" : "pytorch" ,
"status" : "skipped" ,
"notes" : "torch not installed" ,
...
})
If PyTorch is not installed, the comparison runs only the NumPy implementation and marks PyTorch as “skipped”.
Interpreting Results
Training Speed :
PyTorch is typically faster due to optimized C++ backend
NumPy implementation shows your optimization skills
Inference Speed :
PyTorch optimized for batched operations
NumPy competitive for smaller models
Memory Usage :
NumPy uses less memory (no framework overhead)
PyTorch allocates additional memory for autograd
Accuracy :
Both should achieve similar final accuracy
Small differences due to numerical precision
Example Interpretation
# NumPy: train_time = 0.125s, memory = 45MB
# PyTorch: train_time = 0.098s, memory = 128MB
# Speedup: 0.125 / 0.098 = 1.28x faster (PyTorch)
# Memory ratio: 128 / 45 = 2.84x more memory (PyTorch)
Customizing Comparisons
Run custom benchmarks with different configurations:
from compare import benchmark_scratch_vs_torch
# Deep network comparison
results = benchmark_scratch_vs_torch(
layer_sizes = [ 1024 , 512 , 256 , 128 , 10 ],
activations = [ "relu" , "relu" , "relu" , "relu" , "softmax" ],
n_samples = 1000 ,
epochs = 5 ,
batch_size = 64 ,
alpha = 0.01 ,
seed = 42
)
# Wide network comparison
results = benchmark_scratch_vs_torch(
layer_sizes = [ 784 , 256 , 256 , 10 ],
activations = [ "relu" , "relu" , "softmax" ],
n_samples = 500 ,
epochs = 10 ,
batch_size = 128 ,
alpha = 0.05 ,
seed = 123
)
Console Output
The comparison prints a formatted table:
framework | status | train_time_per_epoch_s | ...
-------------------------------------------------------------------------------------------------
scratch_numpy | ok | 0.125 | ...
pytorch | ok | 0.098 | ...
Best Practices
Consistent Seeds : Always use the same seed for fair comparisons:
Warmup Runs : Benchmarks include warmup iterations to avoid initialization overhead.
Multiple Runs : Average over multiple runs (default: 5-10) for stable measurements.
Avoid comparing results from different machines or Python versions - performance characteristics can vary significantly.
Next Steps
Inference Guide Learn about loading checkpoints and running inference
ONNX Export Export models to ONNX for production deployment