Skip to main content
Vector quantization compresses vector representations to reduce memory usage and improve search performance. Qdrant supports three quantization methods, each offering different tradeoffs between compression ratio, search quality, and speed.

Why Quantization?

Full-precision vectors (float32) require significant memory:
  • 768-dimensional vector = 768 × 4 bytes = 3KB per vector
  • 1 million vectors = ~3GB of RAM
  • 100 million vectors = ~300GB of RAM
Quantization can reduce this by 4x to 32x while maintaining search quality.
Qdrant uses quantized vectors for initial filtering, then rescores top candidates with original vectors for maximum accuracy.

Quantization Methods

Scalar

4x compressionFloat32 → Int8Best for general use

Product (PQ)

8-64x compressionVectors → CentroidsMaximum compression

Binary

32x compressionFloat32 → BitsFastest search

Scalar Quantization

Converts float32 values to int8 by learning the value range.

Configuration

PUT /collections/my_collection
{
  "vectors": {
    "size": 768,
    "distance": "Cosine"
  },
  "quantization_config": {
    "scalar": {
      "type": "int8",
      "quantile": 0.99,
      "always_ram": true
    }
  }
}
type
string
default:"int8"
Quantization type. Currently only int8 is supported.
quantile
float
default:"1.0"
Quantile for range calculation (0.5-1.0). Use 0.99 to ignore outliers.
always_ram
boolean
default:"false"
Keep quantized vectors in RAM even when main vectors are on disk.

How It Works

  1. Analyze value distribution across all vectors
  2. Calculate range based on quantile (e.g., 0.99 excludes top/bottom 1%)
  3. Map range to int8 values (-128 to 127)
  4. Store offset and scale for reconstruction
# Pseudocode
min_val, max_val = calculate_quantile_range(vectors, quantile=0.99)
alpha = (max_val - min_val) / 255
quantized = ((vector - min_val) / alpha).round().clip(0, 127).astype(int8)

Memory Savings

  • Original: 768 dimensions × 4 bytes = 3,072 bytes
  • Quantized: 768 dimensions × 1 byte = 768 bytes
  • Savings: 75% reduction (4x compression)

When to Use

Best for most use cases with good balance of compression and quality
Works well with all distance metrics (Cosine, Euclidean, Dot)
Minimal accuracy loss (typically less than 2% recall drop)

Product Quantization (PQ)

Divides vectors into chunks and represents each chunk by the nearest centroid.

Configuration

PUT /collections/my_collection
{
  "vectors": {
    "size": 768,
    "distance": "Cosine"
  },
  "quantization_config": {
    "product": {
      "compression": "x16",
      "always_ram": true
    }
  }
}
compression
string
required
Compression ratio: x8, x16, x32, or x64
always_ram
boolean
default:"false"
Keep quantized vectors in RAM.

How It Works

  1. Divide vector into M equal chunks (subvectors)
  2. Learn 256 centroids for each chunk using k-means
  3. Replace each chunk with nearest centroid ID (1 byte)
  4. Store codebook of centroids for reconstruction
Example with 768-dimensional vector and 16x compression:
  • Chunks: 768 / 48 = 16 chunks
  • Each chunk: 48 dimensions → 1 byte (centroid ID)
  • Result: 768 dims → 16 bytes

Memory Savings

  • Original: 768 × 4 = 3,072 bytes
  • Quantized: 768 / 16 = 48 bytes
  • Savings: 98.4% reduction

Training Requirements

PQ requires training on sample data:
  • Samples needed: ~10,000 vectors (configurable)
  • Training time: Proportional to dataset size and compression ratio
  • K-means iterations: Up to 100 iterations per chunk
PQ training happens automatically when enough vectors are indexed. Initial searches may be less accurate until training completes.

When to Use

Maximum memory reduction needed (millions of vectors)
Willing to accept 5-10% recall drop
Have enough data for training (>10k vectors)

Binary Quantization

Represents each dimension as a bit (positive/negative or above/below threshold).

Configuration

PUT /collections/my_collection
{
  "vectors": {
    "size": 768,
    "distance": "Cosine"
  },
  "quantization_config": {
    "binary": {
      "always_ram": true,
      "encoding": "one_bit",
      "query_encoding": "same_as_storage"
    }
  }
}
always_ram
boolean
default:"false"
Keep quantized vectors in RAM.
encoding
string
default:"one_bit"
Storage encoding: one_bit, two_bits, or one_and_half_bits
query_encoding
string
default:"same_as_storage"
Query encoding for asymmetric quantization: same_as_storage, scalar_4bits, or scalar_8bits

Encoding Methods

Each dimension → 1 bit (sign)
value >= 0 → 1
value < 0  → 0
Compression: 32x for float32

Asymmetric Quantization

Use higher precision for queries than stored vectors:
{
  "binary": {
    "encoding": "one_bit",
    "query_encoding": "scalar_8bits"
  }
}
This improves accuracy at the cost of slightly slower query processing.

Memory Savings

  • Original: 768 × 4 bytes = 3,072 bytes
  • One Bit: 768 bits = 96 bytes
  • Savings: 96.9% reduction (32x compression)

When to Use

Need fastest possible search speed
Working with high-dimensional vectors (>512 dims)
Vectors have roughly balanced distributions

Enabling Quantization

On Collection Creation

PUT /collections/my_collection
{
  "vectors": {
    "size": 384,
    "distance": "Cosine"
  },
  "quantization_config": {
    "scalar": {
      "type": "int8",
      "quantile": 0.99,
      "always_ram": true
    }
  }
}

On Existing Collection

PATCH /collections/my_collection
{
  "quantization_config": {
    "scalar": {
      "type": "int8",
      "quantile": 0.99,
      "always_ram": true
    }
  }
}
Updating quantization on an existing collection triggers background reindexing.

Python Client

from qdrant_client import QdrantClient, models

client = QdrantClient("localhost", port=6333)

client.create_collection(
    collection_name="my_collection",
    vectors_config=models.VectorParams(
        size=384,
        distance=models.Distance.COSINE
    ),
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True
        )
    )
)

Search with Quantization

Default Behavior

Quantized search uses a two-stage approach:
  1. Oversampling: Retrieve more candidates with quantized vectors (e.g., 3x limit)
  2. Rescoring: Re-rank top candidates with original vectors
  3. Return: Top results after rescoring
This maintains high accuracy while benefiting from quantization speed.

Controlling Oversampling

POST /collections/my_collection/points/search
{
  "vector": [0.1, 0.2, 0.3, ...],
  "limit": 10,
  "params": {
    "quantization": {
      "rescore": true,
      "oversampling": 2.0
    }
  }
}
rescore
boolean
default:"true"
Enable rescoring with original vectors.
oversampling
float
default:"depends on method"
Multiply limit by this factor for initial quantized search.Defaults:
  • Scalar: 2.0
  • PQ: 3.0
  • Binary: 3.0

Disabling Rescoring

For maximum speed at cost of accuracy:
{
  "params": {
    "quantization": {
      "rescore": false
    }
  }
}

Performance Tuning

Memory vs Disk Trade-off

{
  "quantization_config": {
    "scalar": {
      "always_ram": true
    }
  },
  "hnsw_config": {
    "on_disk": false
  }
}
Keep everything in RAM for fastest search.

Choosing Compression Ratio

1

Start with Scalar Quantization

Test int8 scalar quantization first - it provides good compression with minimal accuracy loss.
2

Measure Accuracy

Use your test queries to measure recall@k before and after quantization.
3

Increase Compression if Needed

If you need more compression and can tolerate accuracy loss, try PQ with x16 or x32.
4

Tune Oversampling

Increase oversampling factor if accuracy drops too much.

Best Practices

Quantization accuracy varies by dataset. Always test with your actual data and queries before production deployment.
Set always_ram: true if you have sufficient RAM - quantized vectors are small and keeping them in memory significantly improves speed.
Track recall@k during search to ensure quantization doesn’t degrade results below acceptable thresholds.
  • Normalized vectors: Binary quantization works well
  • High variance: Scalar quantization with quantile tuning
  • Very high dimensional: Product quantization for maximum compression

Limitations

  • Quantization is only available for dense vectors (not sparse vectors)
  • Cannot change quantization method without rebuilding the collection
  • PQ requires minimum dataset size for training (~10k vectors)
  • Binary quantization works best with normalized vectors

Memory Calculator

Estimate memory savings:
DimensionsVectorsFloat32Scalar (int8)PQ (x16)Binary (1-bit)
3841M1.5 GB384 MB24 MB48 MB
7681M3.0 GB768 MB48 MB96 MB
15361M6.0 GB1.5 GB96 MB192 MB
76810M30 GB7.5 GB480 MB960 MB
768100M300 GB75 GB4.8 GB9.6 GB
  • HNSW Index - Configure HNSW for optimal performance with quantization
  • Performance Tuning - Additional strategies for reducing memory usage

Build docs developers (and LLMs) love