Skip to main content

Overview

Vertex AI Vector Search (formerly Matching Engine) is a fully managed service that enables fast and scalable similarity search across millions or billions of embeddings. Built on Google’s ScaNN (Scalable Nearest Neighbors) algorithm, it powers search and recommendation features in Google products like YouTube and Google Play.
Vector Search can find nearest neighbors in milliseconds, even with billions of vectors, thanks to the ScaNN algorithm’s advanced quantization techniques.

Key Features

Blazing Fast

Millisecond-level queries across billions of vectors using ScaNN algorithm

Fully Managed

No infrastructure management required - Google handles scaling and operations

Autoscaling

Automatically resize nodes based on workload demands

Real-time Updates

Stream updates to add or remove vectors without reindexing

Getting Started

Installation

pip install --upgrade google-cloud-aiplatform

Enable APIs

gcloud services enable compute.googleapis.com \
    aiplatform.googleapis.com \
    --project YOUR_PROJECT_ID

Setup

from google.cloud import aiplatform
from datetime import datetime

# Configuration
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
UID = datetime.now().strftime("%m%d%H%M")

# Initialize
aiplatform.init(project=PROJECT_ID, location=LOCATION)

Create an Index

1

Prepare Embeddings

Create a JSONL file with your embeddings:
{"id": "1", "embedding": [0.1, 0.2, ..., 0.768]}
{"id": "2", "embedding": [0.3, 0.4, ..., 0.512]}
2

Upload to Cloud Storage

gsutil mb -l us-central1 gs://your-bucket-name
gsutil cp embeddings.json gs://your-bucket-name/
3

Create the Index

my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"my-index-{UID}",
    contents_delta_uri="gs://your-bucket-name/",
    dimensions=768,
    approximate_neighbors_count=10,
    distance_measure_type="DOT_PRODUCT_DISTANCE"
)

Index Parameters

my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="product-search-index",
    contents_delta_uri="gs://my-bucket/embeddings/",
    dimensions=768,
    approximate_neighbors_count=10,
    distance_measure_type="DOT_PRODUCT_DISTANCE"
)
ParameterDescriptionDefault
dimensionsSize of each embedding vectorRequired
approximate_neighbors_countNumber of neighbors to return10
distance_measure_typeDistance metric (DOT_PRODUCT_DISTANCE, COSINE_DISTANCE, SQUARED_L2_DISTANCE)DOT_PRODUCT_DISTANCE
index_update_methodBATCH_UPDATE or STREAM_UPDATEBATCH_UPDATE

Deploy an Index Endpoint

To query your index, deploy it to an endpoint:
# Create endpoint
my_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=f"my-endpoint-{UID}",
    public_endpoint_enabled=True  # Enable public access
)

# Deploy index to endpoint (takes ~30 minutes first time)
DEPLOYED_INDEX_ID = f"deployed_index_{UID}"
my_endpoint.deploy_index(
    index=my_index,
    deployed_index_id=DEPLOYED_INDEX_ID,
    min_replica_count=1,
    max_replica_count=2
)
The first deployment to a new endpoint takes about 30 minutes to provision infrastructure. Subsequent deployments are much faster.

Query the Index

Basic Query

from google import genai
import numpy as np

# Generate query embedding
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)
query_text = "How to reset my password?"

query_embedding = client.models.embed_content(
    model="text-embedding-005",
    contents=[query_text]
).embeddings[0].values

# Search
response = my_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=[query_embedding],
    num_neighbors=10
)

# Process results
for idx, neighbor in enumerate(response[0]):
    print(f"Rank {idx + 1}:")
    print(f"  ID: {neighbor.id}")
    print(f"  Distance: {neighbor.distance}")

Batch Queries

# Multiple queries at once
queries = [
    "How to reset password?",
    "Where is my order?",
    "How to cancel subscription?"
]

# Generate embeddings
query_embeddings = [
    client.models.embed_content(
        model="text-embedding-005",
        contents=[q]
    ).embeddings[0].values
    for q in queries
]

# Batch search
responses = my_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=query_embeddings,
    num_neighbors=5
)

# Process each query's results
for query_idx, query_response in enumerate(responses):
    print(f"\nResults for: {queries[query_idx]}")
    for neighbor in query_response:
        print(f"  - {neighbor.id} (distance: {neighbor.distance:.4f})")

With Filtering

from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    Namespace,
    NumericNamespace
)

# Filter by namespace
response = my_endpoint.find_neighbors(
    deployed_index_id=DEPLOYED_INDEX_ID,
    queries=[query_embedding],
    num_neighbors=10,
    filter=[
        Namespace(name="category", allow_tokens=["electronics", "computers"]),
        NumericNamespace(
            name="price",
            value_int=100,
            op=NumericNamespace.Operator.LESS
        )
    ]
)

Update Index

Stream Updates (Real-time)

Stream updates are only available for indexes created with index_update_method="STREAM_UPDATE".
from google.cloud.aiplatform.matching_engine.matching_engine_index_endpoint import (
    IndexDatapoint
)

# Add new embeddings
new_datapoints = [
    IndexDatapoint(
        datapoint_id="new_1",
        feature_vector=[0.1, 0.2, ...],  # 768 dimensions
    ),
    IndexDatapoint(
        datapoint_id="new_2",
        feature_vector=[0.3, 0.4, ...],
    )
]

my_endpoint.upsert_datapoints(
    deployed_index_id=DEPLOYED_INDEX_ID,
    datapoints=new_datapoints
)

# Remove embeddings
my_endpoint.remove_datapoints(
    deployed_index_id=DEPLOYED_INDEX_ID,
    datapoint_ids=["old_1", "old_2"]
)

Batch Updates

# Upload new embeddings file
! gsutil cp updated_embeddings.json gs://your-bucket/

# Update index
my_index = my_index.update_embeddings(
    contents_delta_uri="gs://your-bucket/updated_embeddings.json"
)

Index Types

Autoscaling

Configure automatic scaling based on demand:
my_endpoint.deploy_index(
    index=my_index,
    deployed_index_id=DEPLOYED_INDEX_ID,
    min_replica_count=1,
    max_replica_count=10,
    enable_access_logging=True,
    automatic_resources=aiplatform.AutomaticResources(
        min_replica_count=1,
        max_replica_count=10
    )
)

Monitoring

Check Index Status

# Get index information
print(f"Index name: {my_index.display_name}")
print(f"Index state: {my_index.index_stats}")
print(f"Deployed: {my_index.deployed_indexes}")

Query Metrics

# Enable access logging
my_endpoint.deploy_index(
    index=my_index,
    deployed_index_id=DEPLOYED_INDEX_ID,
    enable_access_logging=True
)

# View metrics in Cloud Console:
# https://console.cloud.google.com/vertex-ai/matching-engine/index-endpoints

Complete Example

Here’s a full end-to-end example:
from google.cloud import aiplatform
from google import genai
from datetime import datetime
import numpy as np

# Initialize
PROJECT_ID = "your-project-id"
LOCATION = "us-central1"
UID = datetime.now().strftime("%m%d%H%M")

aiplatform.init(project=PROJECT_ID, location=LOCATION)
client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# Create index
index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name=f"docs-index-{UID}",
    contents_delta_uri="gs://your-bucket/embeddings/",
    dimensions=768,
    approximate_neighbors_count=10,
    distance_measure_type="DOT_PRODUCT_DISTANCE"
)

print(f"Index created: {index.resource_name}")

Cleanup

Remember to delete resources to avoid ongoing charges:
# Undeploy index
endpoint.undeploy_all()

# Delete endpoint
endpoint.delete(force=True)

# Delete index
index.delete()

# Delete Cloud Storage bucket
! gsutil rm -r gs://your-bucket-name

Best Practices

1

Choose the Right Index Type

Use Tree-AH for most cases, brute force only for small datasets requiring exact matches
2

Tune Parameters

Adjust approximate_neighbors_count based on your recall requirements (higher = better recall, slower queries)
3

Use Stream Updates

Enable stream updates for real-time applications that need to add/remove vectors frequently
4

Monitor Performance

Enable access logging and monitor query latency in Cloud Console
5

Optimize Costs

Use autoscaling to match capacity with demand and minimize idle resources

Next Steps

Build docs developers (and LLMs) love