Skip to main content

Overview

QuantizeType defines the quantization methods available for compressing vector embeddings. Quantization reduces memory usage and can improve search speed at the cost of some accuracy.
import zvec

print(zvec.QuantizeType.INT8)
# Output: QuantizeType.INT8

Available Quantization Types

UNDEFINED
QuantizeType
No quantization. Vectors are stored in their original precision.Memory: 100% (baseline)Accuracy: 100%When to use: When accuracy is critical and memory is not a constraint.
# No quantization specified
field = Field(
    name="embedding",
    dtype=DataType.VECTOR_FP32,
    dim=768
    # quantize_type not specified = UNDEFINED
)
FP16
QuantizeType
16-bit floating point quantization. Reduces precision from 32-bit to 16-bit floats.Memory: ~50% of original (half precision)Accuracy: ~99.5% (minimal loss for most use cases)When to use: General-purpose compression with negligible quality loss.
field = Field(
    name="embedding",
    dtype=DataType.VECTOR_FP32,
    dim=768,
    quantize_type=QuantizeType.FP16
)
INT8
QuantizeType
8-bit integer quantization. Converts floating point values to 8-bit signed integers.Memory: ~25% of original (75% reduction)Accuracy: ~95-98% (noticeable but acceptable loss)When to use: When memory reduction is important and slight accuracy loss is acceptable.
field = Field(
    name="embedding",
    dtype=DataType.VECTOR_FP32,
    dim=1536,
    quantize_type=QuantizeType.INT8
)
INT4
QuantizeType
4-bit integer quantization. Converts floating point values to 4-bit integers.Memory: ~12.5% of original (87.5% reduction)Accuracy: ~90-95% (significant loss, use with caution)When to use: Extreme memory constraints, large-scale deployments, when recall drop is acceptable.
field = Field(
    name="embedding",
    dtype=DataType.VECTOR_FP32,
    dim=2048,
    quantize_type=QuantizeType.INT4
)

Quantization Properties

All QuantizeType enum members have these properties:
name
str
The name of the quantization type as a string.
QuantizeType.INT8.name  # "INT8"
value
int
The internal integer value of the quantization type.
QuantizeType.INT8.value  # 2

Usage Examples

Basic Quantization

from zvec import Collection, Field, DataType, QuantizeType

schema = [
    Field(name="id", dtype=DataType.STRING, is_primary=True),
    Field(name="title", dtype=DataType.STRING),
    Field(
        name="embedding",
        dtype=DataType.VECTOR_FP32,
        dim=768,
        quantize_type=QuantizeType.INT8  # 8-bit quantization
    )
]

collection = Collection.create(
    name="compressed_docs",
    schema=schema
)

Comparing Quantization Levels

from zvec import QuantizeType

# Memory usage comparison for 1536-dimensional FP32 vector
base_size = 1536 * 4  # 6,144 bytes

quantization_levels = [
    (QuantizeType.UNDEFINED, base_size, "100%"),
    (QuantizeType.FP16, base_size // 2, "50%"),
    (QuantizeType.INT8, base_size // 4, "25%"),
    (QuantizeType.INT4, base_size // 8, "12.5%"),
]

for qtype, size, percent in quantization_levels:
    print(f"{qtype.name:12} | {size:5} bytes | {percent:6} memory")

# Output:
# UNDEFINED    |  6144 bytes |   100% memory
# FP16         |  3072 bytes |    50% memory
# INT8         |  1536 bytes |    25% memory
# INT4         |   768 bytes |  12.5% memory

Multi-Field Schema with Different Quantization

from zvec import Collection, Field, DataType, QuantizeType, MetricType

schema = [
    Field(name="id", dtype=DataType.STRING, is_primary=True),
    
    # High precision for critical field
    Field(
        name="primary_embedding",
        dtype=DataType.VECTOR_FP32,
        dim=768,
        metric=MetricType.COSINE,
        quantize_type=QuantizeType.FP16  # Minimal loss
    ),
    
    # Aggressive compression for secondary field
    Field(
        name="secondary_embedding",
        dtype=DataType.VECTOR_FP32,
        dim=1536,
        metric=MetricType.COSINE,
        quantize_type=QuantizeType.INT8  # More compression
    ),
    
    # Extreme compression for auxiliary field
    Field(
        name="auxiliary_embedding",
        dtype=DataType.VECTOR_FP32,
        dim=2048,
        metric=MetricType.L2,
        quantize_type=QuantizeType.INT4  # Maximum compression
    )
]

collection = Collection.create(
    name="multi_embedding",
    schema=schema
)

Choosing the Right Quantization

Decision Matrix

FP16

Best balance for most use cases✅ 50% memory savings ✅ ~99.5% accuracy retained ✅ Minimal quality loss ✅ Good for productionUse for: Text embeddings, semantic search, general applications

INT8

Good compression with acceptable quality✅ 75% memory savings ⚠️ ~95-98% accuracy ⚠️ Noticeable but acceptable loss ✅ Faster searchUse for: Large-scale systems, cost-sensitive deployments, when quality drop is acceptable

INT4

Extreme compression for specific needs✅ 87.5% memory savings ❌ ~90-95% accuracy ❌ Significant quality loss ✅ Very fast searchUse for: Massive scale (billions of vectors), memory-critical environments, when recall drop is acceptable

UNDEFINED (No Quantization)

Maximum quality, baseline✅ 100% accuracy ❌ 100% memory usage ❌ Slower searchUse for: Critical accuracy requirements, small datasets, benchmarking

Quantization Trade-offs

QuantizationMemory per 768-dim vectorSavings
UNDEFINED (FP32)3,072 bytes0%
FP161,536 bytes50%
INT8768 bytes75%
INT4384 bytes87.5%
For 1 million vectors (768-dim):
  • FP32: ~3 GB
  • FP16: ~1.5 GB
  • INT8: ~768 MB
  • INT4: ~384 MB

Best Practices

1

Start with FP16

Begin with FP16 quantization for most applications. It provides excellent memory savings with minimal quality loss.
quantize_type=QuantizeType.FP16  # Safe default
2

Benchmark with Your Data

Test different quantization levels with your specific dataset and queries:
# Create test collections with different quantization
for qtype in [QuantizeType.UNDEFINED, QuantizeType.FP16, 
              QuantizeType.INT8, QuantizeType.INT4]:
    test_collection = create_test_collection(qtype)
    recall = evaluate_recall(test_collection, test_queries)
    print(f"{qtype.name}: {recall:.2%} recall")
3

Consider Your Scale

Choose quantization based on dataset size:
  • < 1M vectors: FP16 or UNDEFINED
  • 1-10M vectors: FP16 or INT8
  • 10-100M vectors: INT8
  • > 100M vectors: INT8 or INT4
4

Monitor Quality Metrics

Track recall, precision, and user satisfaction after deploying quantization:
# Monitor search quality
metrics = {
    "recall_at_10": 0.95,
    "ndcg_at_10": 0.92,
    "user_clicks": 0.85
}

Quantization and Vector Types

Compatibility: Quantization is typically applied to VECTOR_FP32 fields. Using VECTOR_FP16 already provides 16-bit storage, so additional quantization may not be beneficial.
# ❌ Redundant: FP16 vector with FP16 quantization
Field(
    dtype=DataType.VECTOR_FP16,
    quantize_type=QuantizeType.FP16
)

# ✅ Correct: FP32 vector with FP16 quantization
Field(
    dtype=DataType.VECTOR_FP32,
    quantize_type=QuantizeType.FP16
)

Advanced Considerations

Re-ranking with Quantized Vectors

For critical applications, use quantized vectors for initial retrieval, then re-rank with full precision:
# Retrieve more candidates with quantized vectors
results = collection.query(
    vectors={"quantized_vec": query_embedding},
    topn=100  # Over-retrieve
)

# Re-rank top results with full precision or external model
final_results = rerank(results, topn=10)

Calibration

Some quantization methods benefit from calibration on representative data:
# Collect calibration data
calibration_vectors = sample_representative_vectors(n=10000)

# Quantization uses calibration internally
field = Field(
    name="embedding",
    dtype=DataType.VECTOR_FP32,
    dim=768,
    quantize_type=QuantizeType.INT8
)

See Also

Build docs developers (and LLMs) love