Skip to main content

Overview

The model supports three numeric precision modes for inference: float32 (full precision), float16 (half precision), and int8 (quantized). These modes trade numerical accuracy for reduced memory footprint and potentially faster computation.
Precision modes are simulation-based in this implementation. They approximate the behavior of native low-precision kernels but do not provide actual hardware acceleration.

Precision Configuration

Precision is controlled through the PrecisionConfig dataclass:
config.py:6-11
@dataclass
class PrecisionConfig:
    train_dtype: str = "float32"
    infer_precision: str = "float32"  # float32 | float16 | int8
    int8_clip_value: int = 127
    seed: int = 42
train_dtype
string
default:"float32"
Data type used during training (always float32 in this implementation)
infer_precision
string
default:"float32"
Precision mode for inference: float32, float16, or int8
int8_clip_value
int
default:"127"
Maximum absolute value for int8 quantization (typically 127 for symmetric quantization)

Float32: Full Precision

IEEE 754 single-precision floating point
  • Storage: 4 bytes per value
  • Range: ±1.4e-45 to ±3.4e38
  • Precision: ~7 decimal digits
  • Use case: Training and high-accuracy inference
Float32 is the default mode and used for all training. The forward pass operates entirely in float32:
model.py:87-91
def forward(self, x, training=False, precision=None):
    selected_precision = self.infer_precision if precision is None else precision

    if not training and selected_precision in {"float16", "int8"}:
        return self._forward_with_precision(x.astype(np.float32), selected_precision)

Float16: Half Precision

IEEE 754 half-precision floating point
  • Storage: 2 bytes per value
  • Range: ±6e-8 to ±65,504
  • Precision: ~3 decimal digits
  • Use case: Memory-constrained inference

Float16 Implementation

Float16 inference is implemented by casting weights and activations:
model.py:82-84
else:
    dtype = np.float16 if precision == "float16" else np.float32
    z = layer.forward(current.astype(dtype), weights=layer.weights.astype(dtype), bias=layer.bias.astype(dtype))
    current = activation_forward(z, activation_name).astype(dtype)
1

Cast inputs

Convert layer input to float16
2

Cast parameters

Convert weights and biases to float16
3

Compute in float16

Matrix multiply and activation in half precision
4

Propagate float16

Output remains in float16 for next layer
This is a simulation that converts float32 to float16 in software. On real hardware with native float16 support (e.g., NVIDIA Tensor Cores), the speedup would be much more significant.

Float16 Accuracy Considerations

For the Fashion-MNIST task, float16 typically shows minimal accuracy degradation:
  • Float32 test accuracy: ~88-90%
  • Float16 test accuracy: ~88-89% (0-1% drop)
  • Memory: 50% of float32

Int8: Quantized Precision

8-bit signed integer with dynamic quantization
  • Storage: 1 byte per value
  • Range: -128 to 127 (before scaling)
  • Precision: Integer values only
  • Use case: Extreme memory constraints, edge deployment

Int8 Quantization Scheme

This implementation uses dynamic symmetric quantization:
model.py:55-65
def _quantize_to_int8(self, x):
    max_abs = np.max(np.abs(x))
    if max_abs == 0:
        return np.zeros_like(x, dtype=np.int8), 1.0
    scale = max_abs / float(self.int8_clip_value)
    q = np.clip(np.round(x / scale), -self.int8_clip_value, self.int8_clip_value).astype(np.int8)
    return q, scale

@staticmethod
def _dequantize_from_int8(q, scale):
    return q.astype(np.float32) * np.float32(scale)

Quantization Algorithm

1

Find max absolute value

max_abs = max(|x|) across the entire tensor
2

Compute scale factor

scale = max_abs / 127 (maps max value to ±127)
3

Scale and round

q = round(x / scale) to get integer values
4

Clip to range

q = clip(q, -127, 127) to ensure valid int8
5

Dequantize for use

x_approx = q * scale to convert back to float
This is a per-tensor quantization scheme. More sophisticated approaches use per-channel quantization or learned quantization parameters for better accuracy.

Int8 Forward Pass

Each layer operation involves quantization, computation, and dequantization:
model.py:70-80
if precision == "int8":
    q_a, a_scale = self._quantize_to_int8(current)
    q_w, w_scale = self._quantize_to_int8(layer.weights)
    a_deq = self._dequantize_from_int8(q_a, a_scale)
    w_deq = self._dequantize_from_int8(q_w, w_scale)
    z = a_deq @ w_deq + layer.bias.astype(np.float32)
    z_q, z_scale = self._quantize_to_int8(z)
    z = self._dequantize_from_int8(z_q, z_scale)
    a = activation_forward(z, activation_name)
    a_q, a_scale = self._quantize_to_int8(a)
    current = self._dequantize_from_int8(a_q, a_scale)
Quantization points:
  1. Quantize activations from previous layer
  2. Quantize current layer weights
  3. Dequantize both for matrix multiply
  4. Compute z = a @ w + b in float32
  5. Quantize pre-activation z
  6. Dequantize for activation function
  7. Apply activation (in float32)
  8. Quantize output activations for next layer
This implementation uses quantize-dequantize (fake quantization) where computation happens in float32. True int8 inference would use integer matrix multiplication, which is significantly faster on specialized hardware.

Int8 Accuracy Impact

Typical accuracy degradation for Fashion-MNIST:
  • Float32 baseline: ~88-90%
  • Int8 quantized: ~85-88% (2-5% drop)
  • Memory: 25% of float32
The accuracy loss comes from:
  • Quantization error (rounding to nearest integer)
  • Clipping extreme values to [-127, 127]
  • Accumulated error through layer propagation
For better int8 accuracy, consider:
  • Post-training quantization with calibration data
  • Quantization-aware training (QAT)
  • Per-channel quantization instead of per-tensor
  • Learnable quantization parameters

Memory Comparison

For a 784-64-10 network (50,890 parameters):
PrecisionBytes per paramTotal params sizeActivations (B=32)Total memory
float324198 KB105 KB303 KB
float16299 KB52 KB151 KB ↓50%
int8149 KB26 KB75 KB ↓75%
These reductions are additive with batch size reductions for extreme memory constraints.

Precision Mode Usage

At Model Creation

from config import PrecisionConfig
from student import NeuralNetwork

config = PrecisionConfig(
    train_dtype="float32",
    infer_precision="float16",  # Use float16 for inference
    seed=42
)

model = NeuralNetwork(
    layer_sizes=[784, 64, 10],
    activations=["relu", "softmax"],
    precision_config=config
)

Override at Inference Time

# Model configured for float16, but override to int8
predictions = model.predict(X_test, precision="int8")

# Explicitly use float32 for critical inference
accurate_preds = model.predict(X_critical, precision="float32")

In Benchmarking

from benchmark import benchmark_one_setup

result = benchmark_one_setup(
    layer_sizes=[784, 64, 10],
    activations=["relu", "softmax"],
    precision_mode="int8",  # Benchmark int8 performance
    batch_size=32,
    n_samples=512,
    epochs=2,
    seed=42
)

print(f"Int8 latency: {result['inference_latency_per_sample_s']:.6f}s")
print(f"Int8 memory: {result['peak_memory_mb']:.3f} MB")

Benchmark Comparison

Typical benchmark results from benchmark.py on a modern CPU:
# Example output
[
  {
    "precision_mode": "float32",
    "inference_latency_per_sample_s": 0.000142,
    "peak_memory_mb": 1.234,
    "final_train_accuracy": 0.891
  },
  {
    "precision_mode": "float16",
    "inference_latency_per_sample_s": 0.000138,  # ~3% faster
    "peak_memory_mb": 0.687,  # 50% memory
    "final_train_accuracy": 0.887  # -0.4% accuracy
  },
  {
    "precision_mode": "int8",
    "inference_latency_per_sample_s": 0.000151,  # Slower in simulation!
    "peak_memory_mb": 0.412,  # 75% reduction
    "final_train_accuracy": 0.863  # -2.8% accuracy
  }
]
Int8 can be slower in this software simulation due to quantization overhead. On real hardware with int8 acceleration (e.g., Intel VNNI, ARM dot-product), it would be significantly faster.

Design Decisions

  • Stability: Training requires high precision for gradient accumulation
  • Simplicity: Mixed-precision training adds complexity
  • Scope: This project focuses on inference-time precision trade-offs
  • Accuracy: Float32 training → low-precision inference is the standard deployment pattern
  • No calibration required: Dynamic quantization computes scale factors on-the-fly
  • Simpler implementation: No need for calibration data or profiling
  • Trade-off: Per-tensor overhead makes it slower than static quantization
  • Production note: Static quantization is preferred for deployment
  • Simpler math: Symmetric range [-127, 127] with zero at 0
  • No zero-point offset: Reduces computation complexity
  • Trade-off: Wastes a bit of range for asymmetric distributions
  • Alternative: Asymmetric quantization uses the full [-128, 127] range
  • bfloat16: Brain floating point (16-bit with float32 range) not implemented
  • int4/int2: Lower precision requires specialized kernel support
  • Mixed precision: Per-layer precision could be added as an extension
  • Scope: Current implementation covers the most common deployment scenarios

Limitations

These are software simulations that approximate low-precision behavior:
  • No hardware acceleration: Computation still happens in float32/int
  • Overhead: Quantization/dequantization adds latency in software
  • Conservative: Real optimized kernels can be much faster
  • Portability: Results may not match native low-precision implementations
  • Accuracy differences: Hardware quantization may have different rounding behavior

When to Use Each Precision

Float32

Use when:
  • Maximum accuracy required
  • Memory not constrained
  • Training the model
  • Debugging and development

Float16

Use when:
  • Moderate memory constraints
  • Minimal accuracy loss acceptable
  • Deploying to GPU with Tensor Cores
  • Good balance for most use cases

Int8

Use when:
  • Extreme memory constraints
  • Edge device deployment
  • CPU with int8 instructions (VNNI, etc.)
  • 2-5% accuracy loss acceptable

Next Steps

Hardware Constraints

Combine precision modes with memory constraints

Reproducibility

Ensure consistent precision behavior across runs

Build docs developers (and LLMs) love