Why Quantization?
Full-precision vectors (float32) require significant memory:- 768-dimensional vector = 768 × 4 bytes = 3KB per vector
- 1 million vectors = ~3GB of RAM
- 100 million vectors = ~300GB of RAM
Qdrant uses quantized vectors for initial filtering, then rescores top candidates with original vectors for maximum accuracy.
Quantization Methods
Scalar
4x compressionFloat32 → Int8Best for general use
Product (PQ)
8-64x compressionVectors → CentroidsMaximum compression
Binary
32x compressionFloat32 → BitsFastest search
Scalar Quantization
Converts float32 values to int8 by learning the value range.Configuration
Quantization type. Currently only
int8 is supported.Quantile for range calculation (0.5-1.0). Use 0.99 to ignore outliers.
Keep quantized vectors in RAM even when main vectors are on disk.
How It Works
- Analyze value distribution across all vectors
- Calculate range based on quantile (e.g., 0.99 excludes top/bottom 1%)
- Map range to int8 values (-128 to 127)
- Store offset and scale for reconstruction
Memory Savings
- Original: 768 dimensions × 4 bytes = 3,072 bytes
- Quantized: 768 dimensions × 1 byte = 768 bytes
- Savings: 75% reduction (4x compression)
When to Use
Best for most use cases with good balance of compression and quality
Works well with all distance metrics (Cosine, Euclidean, Dot)
Minimal accuracy loss (typically less than 2% recall drop)
Product Quantization (PQ)
Divides vectors into chunks and represents each chunk by the nearest centroid.Configuration
Compression ratio:
x8, x16, x32, or x64Keep quantized vectors in RAM.
How It Works
- Divide vector into M equal chunks (subvectors)
- Learn 256 centroids for each chunk using k-means
- Replace each chunk with nearest centroid ID (1 byte)
- Store codebook of centroids for reconstruction
- Chunks: 768 / 48 = 16 chunks
- Each chunk: 48 dimensions → 1 byte (centroid ID)
- Result: 768 dims → 16 bytes
Memory Savings
- x16 Compression
- x32 Compression
- x64 Compression
- Original: 768 × 4 = 3,072 bytes
- Quantized: 768 / 16 = 48 bytes
- Savings: 98.4% reduction
Training Requirements
PQ requires training on sample data:- Samples needed: ~10,000 vectors (configurable)
- Training time: Proportional to dataset size and compression ratio
- K-means iterations: Up to 100 iterations per chunk
When to Use
Maximum memory reduction needed (millions of vectors)
Willing to accept 5-10% recall drop
Have enough data for training (>10k vectors)
Binary Quantization
Represents each dimension as a bit (positive/negative or above/below threshold).Configuration
Keep quantized vectors in RAM.
Storage encoding:
one_bit, two_bits, or one_and_half_bitsQuery encoding for asymmetric quantization:
same_as_storage, scalar_4bits, or scalar_8bitsEncoding Methods
- One Bit
- Two Bits
- One and Half Bits
Each dimension → 1 bit (sign)Compression: 32x for float32
Asymmetric Quantization
Use higher precision for queries than stored vectors:Memory Savings
- Original: 768 × 4 bytes = 3,072 bytes
- One Bit: 768 bits = 96 bytes
- Savings: 96.9% reduction (32x compression)
When to Use
Need fastest possible search speed
Working with high-dimensional vectors (>512 dims)
Vectors have roughly balanced distributions
Enabling Quantization
On Collection Creation
On Existing Collection
Updating quantization on an existing collection triggers background reindexing.
Python Client
Search with Quantization
Default Behavior
Quantized search uses a two-stage approach:- Oversampling: Retrieve more candidates with quantized vectors (e.g., 3x limit)
- Rescoring: Re-rank top candidates with original vectors
- Return: Top results after rescoring
Controlling Oversampling
Enable rescoring with original vectors.
Multiply limit by this factor for initial quantized search.Defaults:
- Scalar: 2.0
- PQ: 3.0
- Binary: 3.0
Disabling Rescoring
For maximum speed at cost of accuracy:Performance Tuning
Memory vs Disk Trade-off
- Maximum Speed
- Balanced
- Maximum Compression
Choosing Compression Ratio
Start with Scalar Quantization
Test int8 scalar quantization first - it provides good compression with minimal accuracy loss.
Increase Compression if Needed
If you need more compression and can tolerate accuracy loss, try PQ with x16 or x32.
Best Practices
Test on Your Data
Test on Your Data
Quantization accuracy varies by dataset. Always test with your actual data and queries before production deployment.
Use always_ram for Speed
Use always_ram for Speed
Set
always_ram: true if you have sufficient RAM - quantized vectors are small and keeping them in memory significantly improves speed.Monitor Recall Metrics
Monitor Recall Metrics
Track recall@k during search to ensure quantization doesn’t degrade results below acceptable thresholds.
Consider Vector Characteristics
Consider Vector Characteristics
- Normalized vectors: Binary quantization works well
- High variance: Scalar quantization with quantile tuning
- Very high dimensional: Product quantization for maximum compression
Limitations
- Quantization is only available for dense vectors (not sparse vectors)
- Cannot change quantization method without rebuilding the collection
- PQ requires minimum dataset size for training (~10k vectors)
- Binary quantization works best with normalized vectors
Memory Calculator
Estimate memory savings:| Dimensions | Vectors | Float32 | Scalar (int8) | PQ (x16) | Binary (1-bit) |
|---|---|---|---|---|---|
| 384 | 1M | 1.5 GB | 384 MB | 24 MB | 48 MB |
| 768 | 1M | 3.0 GB | 768 MB | 48 MB | 96 MB |
| 1536 | 1M | 6.0 GB | 1.5 GB | 96 MB | 192 MB |
| 768 | 10M | 30 GB | 7.5 GB | 480 MB | 960 MB |
| 768 | 100M | 300 GB | 75 GB | 4.8 GB | 9.6 GB |
Related Topics
- HNSW Index - Configure HNSW for optimal performance with quantization
- Performance Tuning - Additional strategies for reducing memory usage