Overview
Zvec supports multiple quantization types that compress 32-bit floating-point vectors into lower-precision formats:| Quantization Type | Bits per Dimension | Memory Reduction | Accuracy Impact |
|---|---|---|---|
| None (FP32) | 32 | Baseline (1x) | Baseline (100%) |
| FP16 | 16 | 50% | Minimal (<1% loss) |
| INT8 | 8 | 75% | Low (1-3% loss) |
| INT4 | 4 | 87.5% | Medium (3-5% loss) |
QuantizeType Enum
Configuration
Quantization is configured per index through thequantize_type parameter:
HNSW with Quantization
IVF with Quantization
Flat with Quantization
Memory vs Accuracy Trade-offs
FP16 (Half Precision)
Compression: 2xAccuracy Loss: <1%
Best For: Production systems requiring minimal accuracy loss
- Minimal accuracy degradation
- Native hardware support on modern GPUs
- Good balance of speed and memory
- Less compression than INT8/INT4
- Still requires significant memory for large datasets
- Production deployments with strict accuracy requirements
- Datasets with 100K-10M vectors
- GPU-accelerated workloads
INT8 (8-bit Integer)
Compression: 4xAccuracy Loss: 1-3%
Best For: Large-scale deployments where memory is a constraint
- Significant memory reduction
- Fast SIMD operations on CPU
- Good accuracy for most applications
- Requires calibration/training data for optimal quantization
- May impact recall on high-dimensional sparse data
- Datasets with >1M vectors
- Memory-constrained environments
- Embeddings with redundant information (e.g., text embeddings)
INT4 (4-bit Integer)
Compression: 8xAccuracy Loss: 3-5%
Best For: Extreme memory optimization scenarios
- Maximum compression
- Enables very large datasets in memory
- Useful for edge devices
- Noticeable accuracy degradation
- Limited hardware acceleration
- Not suitable for all embedding types
- Edge deployments with limited memory
- Datasets with >10M vectors
- Coarse-grained retrieval followed by re-ranking
Performance Impact
Search Latency
Quantization can improve search latency due to better cache utilization:| Index Type | Quantization | QPS (1M vectors) | Latency (p99) |
|---|---|---|---|
| HNSW | None (FP32) | 25,000 | 2.5ms |
| HNSW | FP16 | 30,000 | 2.0ms |
| HNSW | INT8 | 35,000 | 1.8ms |
| HNSW | INT4 | 40,000 | 1.5ms |
Recall Comparison
Quantization Strategy
Decision Tree
Complete Example
Best Practices
1. Start with FP16
For most production use cases, FP16 provides the best balance:2. Benchmark Your Data
Different embedding models respond differently to quantization:3. Use Refiner for INT8/INT4
Enable query-time refinement to recover accuracy:4. Consider Hybrid Approaches
Use aggressive quantization with re-ranking:5. Monitor Recall
Track recall metrics in production:Limitations
- Quantization is lossy: Some information is discarded
- Not reversible: Cannot recover original FP32 from quantized vectors
- Calibration dependent: INT8/INT4 quality depends on training data distribution
- Hardware support: INT4 may not be hardware-accelerated on all platforms
See Also
- Index Types - Choose the right index
- Performance Tuning - Optimize query speed
- Reranking - Improve accuracy with two-stage retrieval