Overview
The model supports three numeric precision modes for inference: float32 (full precision), float16 (half precision), and int8 (quantized). These modes trade numerical accuracy for reduced memory footprint and potentially faster computation.Precision Configuration
Precision is controlled through thePrecisionConfig dataclass:
config.py:6-11
Data type used during training (always float32 in this implementation)
Precision mode for inference:
float32, float16, or int8Maximum absolute value for int8 quantization (typically 127 for symmetric quantization)
Float32: Full Precision
IEEE 754 single-precision floating point- Characteristics
- Trade-offs
- Storage: 4 bytes per value
- Range: ±1.4e-45 to ±3.4e38
- Precision: ~7 decimal digits
- Use case: Training and high-accuracy inference
model.py:87-91
Float16: Half Precision
IEEE 754 half-precision floating point- Characteristics
- Trade-offs
- Storage: 2 bytes per value
- Range: ±6e-8 to ±65,504
- Precision: ~3 decimal digits
- Use case: Memory-constrained inference
Float16 Implementation
Float16 inference is implemented by casting weights and activations:model.py:82-84
This is a simulation that converts float32 to float16 in software. On real hardware with native float16 support (e.g., NVIDIA Tensor Cores), the speedup would be much more significant.
Float16 Accuracy Considerations
For the Fashion-MNIST task, float16 typically shows minimal accuracy degradation:- Float32 test accuracy: ~88-90%
- Float16 test accuracy: ~88-89% (0-1% drop)
- Memory: 50% of float32
Int8: Quantized Precision
8-bit signed integer with dynamic quantization- Characteristics
- Trade-offs
- Storage: 1 byte per value
- Range: -128 to 127 (before scaling)
- Precision: Integer values only
- Use case: Extreme memory constraints, edge deployment
Int8 Quantization Scheme
This implementation uses dynamic symmetric quantization:model.py:55-65
Quantization Algorithm
This is a per-tensor quantization scheme. More sophisticated approaches use per-channel quantization or learned quantization parameters for better accuracy.
Int8 Forward Pass
Each layer operation involves quantization, computation, and dequantization:model.py:70-80
- Quantize activations from previous layer
- Quantize current layer weights
- Dequantize both for matrix multiply
- Compute
z = a @ w + bin float32 - Quantize pre-activation
z - Dequantize for activation function
- Apply activation (in float32)
- Quantize output activations for next layer
Int8 Accuracy Impact
Typical accuracy degradation for Fashion-MNIST:- Float32 baseline: ~88-90%
- Int8 quantized: ~85-88% (2-5% drop)
- Memory: 25% of float32
- Quantization error (rounding to nearest integer)
- Clipping extreme values to [-127, 127]
- Accumulated error through layer propagation
Memory Comparison
For a 784-64-10 network (50,890 parameters):| Precision | Bytes per param | Total params size | Activations (B=32) | Total memory |
|---|---|---|---|---|
| float32 | 4 | 198 KB | 105 KB | 303 KB |
| float16 | 2 | 99 KB | 52 KB | 151 KB ↓50% |
| int8 | 1 | 49 KB | 26 KB | 75 KB ↓75% |
These reductions are additive with batch size reductions for extreme memory constraints.
Precision Mode Usage
At Model Creation
Override at Inference Time
In Benchmarking
Benchmark Comparison
Typical benchmark results frombenchmark.py on a modern CPU:
Int8 can be slower in this software simulation due to quantization overhead. On real hardware with int8 acceleration (e.g., Intel VNNI, ARM dot-product), it would be significantly faster.
Design Decisions
Why train in float32 only?
Why train in float32 only?
- Stability: Training requires high precision for gradient accumulation
- Simplicity: Mixed-precision training adds complexity
- Scope: This project focuses on inference-time precision trade-offs
- Accuracy: Float32 training → low-precision inference is the standard deployment pattern
Why dynamic quantization instead of static?
Why dynamic quantization instead of static?
- No calibration required: Dynamic quantization computes scale factors on-the-fly
- Simpler implementation: No need for calibration data or profiling
- Trade-off: Per-tensor overhead makes it slower than static quantization
- Production note: Static quantization is preferred for deployment
Why symmetric quantization?
Why symmetric quantization?
- Simpler math: Symmetric range [-127, 127] with zero at 0
- No zero-point offset: Reduces computation complexity
- Trade-off: Wastes a bit of range for asymmetric distributions
- Alternative: Asymmetric quantization uses the full [-128, 127] range
What about other precisions?
What about other precisions?
- bfloat16: Brain floating point (16-bit with float32 range) not implemented
- int4/int2: Lower precision requires specialized kernel support
- Mixed precision: Per-layer precision could be added as an extension
- Scope: Current implementation covers the most common deployment scenarios
Limitations
When to Use Each Precision
Float32
Use when:
- Maximum accuracy required
- Memory not constrained
- Training the model
- Debugging and development
Float16
Use when:
- Moderate memory constraints
- Minimal accuracy loss acceptable
- Deploying to GPU with Tensor Cores
- Good balance for most use cases
Int8
Use when:
- Extreme memory constraints
- Edge device deployment
- CPU with int8 instructions (VNNI, etc.)
- 2-5% accuracy loss acceptable
Next Steps
Hardware Constraints
Combine precision modes with memory constraints
Reproducibility
Ensure consistent precision behavior across runs