Index Operations - Materialize

Overview

Indexes in Materialize represent query results stored in memory within a cluster. Unlike traditional databases, Materialize indexes maintain complete, up-to-date results—not just keys and pointers. This guide covers when to create indexes, how to optimize them, and operational best practices.

Understanding Indexes

What Indexes Store

Indexes maintain:

Complete, up-to-date query results in memory
Incrementally updated data as new information arrives
Results scoped to a specific cluster

Indexes vs. Materialized Views

Feature	Index	Materialized View
Storage	In-memory (cluster-local)	Durable storage (system-wide)
Access	Same cluster only	Any cluster
Use case	Fast queries within cluster	Share results across clusters
Query cost	Computationally free	Computationally free
Query speed	Fastest (from memory)	Fast (from storage)

Rule of thumb: Use indexes for cluster-local query acceleration. Use materialized views when results need to be shared across multiple clusters.

When to Create Indexes

On Sources

Rarely needed. Consider indexing a source when:

Serving results directly from the source
Using the source in joins without transformation

CREATE INDEX idx_kafka_source ON my_kafka_source (key_column);

In practice, you’ll typically create views that transform sources and index those views instead.

On Views

Create indexes on views to maintain up-to-date results in memory:

CREATE VIEW active_orders AS
  SELECT * FROM orders
  WHERE status = 'active';

-- Index the view for fast queries
CREATE INDEX idx_active_orders ON active_orders (customer_id);

During index creation:

View is executed and results stored in memory
As new data arrives, results update incrementally
Queries are served instantly from memory with no computation

On Materialized Views

Indexing a materialized view loads durable results into memory:

CREATE MATERIALIZED VIEW order_totals AS
  SELECT customer_id, SUM(amount) AS total
  FROM orders
  GROUP BY customer_id;

-- Index for faster in-memory access
CREATE INDEX idx_order_totals ON order_totals (customer_id);

Benefits:

Materialized views are queryable from any cluster
Indexes provide faster access within the cluster where created
No additional computation required (results already maintained)

Creating Indexes

Basic Index Creation

Create an index on specific columns:

CREATE INDEX idx_orders_customer 
ON orders (customer_id);

Multi-Column Indexes

Index multiple columns for complex queries:

CREATE INDEX idx_orders_customer_status 
ON orders (customer_id, status);

Expression Indexes

Index computed expressions:

CREATE INDEX idx_orders_total 
ON orders (quantity * price);

CREATE INDEX idx_customers_upper_name 
ON customers (upper(name));

Default Indexes

Create an index using inferred unique keys:

CREATE DEFAULT INDEX ON my_view;

Materialize uses the view’s unique key (if available) as the index key. If no unique key exists, all columns are used.

Cluster-Specific Indexes

Explicitly specify the target cluster:

CREATE INDEX idx_orders 
IN CLUSTER serving 
ON orders (customer_id);

Remember: Indexes are cluster-local. Queries from other clusters cannot use them.

Index Usage Patterns

Point Lookups

Materialize performs point lookups (reads only matching records) when the query’s WHERE clause:

Specifies equality (= or IN) on all indexed fields
Uses only AND to combine conditions
Matches the exact indexed expression (including type)

-- Create index
CREATE INDEX idx_orders_qty ON orders (quantity);

-- Point lookup (fast)
SELECT * FROM orders WHERE quantity = 10;

-- Point lookup with IN (fast)
SELECT * FROM orders WHERE quantity IN (5, 10, 15);

-- Full scan (slower)
SELECT * FROM orders WHERE quantity > 10;

Multi-Column Point Lookups

-- Create multi-column index
CREATE INDEX idx_orders_customer_status 
ON orders (customer_id, status);

-- Point lookup (fast)
SELECT * FROM orders 
WHERE customer_id = 123 AND status = 'shipped';

-- Partial point lookup (filters on customer_id, scans status)
SELECT * FROM orders 
WHERE customer_id = 123;

-- Full scan (slower)
SELECT * FROM orders 
WHERE status = 'shipped';

Expression Matching

For expression indexes, queries must match exactly:

-- Index on expression
CREATE INDEX idx_orders_total ON orders (quantity * price);

-- Point lookup (fast) - exact match
SELECT * FROM orders WHERE quantity * price = 100;

-- Full scan (slower) - different expression
SELECT * FROM orders WHERE price * quantity = 100;

Index Scans

Materialize performs full index scans when:

Not all indexed fields are specified
Using non-equality conditions (>, <, LIKE)
Using OR between different fields
Type mismatch in equality conditions

CREATE INDEX idx_orders_qty_price ON orders (quantity, price);

-- Full scan - only one indexed field specified
SELECT * FROM orders WHERE quantity = 10;

-- Full scan - range query
SELECT * FROM orders WHERE quantity > 10 AND price > 5;

-- Full scan - OR between different fields  
SELECT * FROM orders WHERE quantity = 10 OR price = 5;

Verifying Index Usage

Use EXPLAIN to verify how queries use indexes:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

Point lookup output:

Explained Query (fast path):
  ReadIndex on=orders idx_orders_customer=[lookup value=(123)]

Used Indexes:
  - idx_orders_customer (lookup)

Look for:

lookup value= indicates point lookup (fastest)
(lookup) in Used Indexes section
fast path means query executes outside regular dataflow

Full scan output:

Used Indexes:
  - idx_orders_customer (scan)

Index Maintenance

Monitoring Index Memory Usage

Check memory consumption by index:

SELECT 
  i.name AS index_name,
  c.name AS cluster_name,
  SUM(s.size_bytes) / (1024.0 * 1024.0 * 1024.0) AS size_gb
FROM mz_indexes i
JOIN mz_clusters c ON i.cluster_id = c.id
JOIN mz_internal.mz_storage_usage s ON i.id = s.object_id
GROUP BY i.name, c.name
ORDER BY size_gb DESC;

Viewing All Indexes

List indexes with their configuration:

SELECT 
  i.name AS index_name,
  c.name AS cluster_name,
  obj.name AS on_object,
  i.create_sql
FROM mz_indexes i
JOIN mz_clusters c ON i.cluster_id = c.id
JOIN mz_objects obj ON i.on_id = obj.id
ORDER BY c.name, i.name;

Viewing Index Keys

Inspect index column definitions:

SELECT 
  i.name AS index_name,
  ic.index_position,
  ic.on_position,
  ic.on_expression,
  ic.nullable
FROM mz_index_columns ic
JOIN mz_indexes i ON ic.index_id = i.id
WHERE i.name = 'idx_orders_customer'
ORDER BY ic.index_position;

Dropping Unused Indexes

Remove indexes to free memory:

DROP INDEX idx_orders_customer;

Important: Dropping an index that other queries depend on will force those queries to recompute results or use alternative indexes.

Rebuilding Indexes

To rebuild an index with a different configuration:

-- Drop existing index
DROP INDEX idx_orders_old;

-- Create new index with updated keys
CREATE INDEX idx_orders_new ON orders (new_column);

Optimization Strategies

Point Lookup Optimization

Design indexes for equality-based queries:

-- Query pattern
SELECT * FROM orders WHERE customer_id = ? AND status = ?;

-- Optimized index (either column order works)
CREATE INDEX idx_orders_customer_status 
ON orders (customer_id, status);

-- OR
CREATE INDEX idx_orders_status_customer 
ON orders (status, customer_id);

Join Optimization

Index join keys to accelerate joins:

CREATE VIEW customer_orders AS
  SELECT c.name, o.total
  FROM customers c
  JOIN orders o ON c.id = o.customer_id;

-- Index the join key on both sides
CREATE INDEX idx_customers_id ON customers (id);
CREATE INDEX idx_orders_customer_id ON orders (customer_id);

-- Now index the joined view for fast queries
CREATE INDEX idx_customer_orders ON customer_orders (name);

Shared Indexes

Multiple queries can share the same index:

-- Index shared by multiple queries
CREATE INDEX idx_orders_customer ON orders (customer_id);

-- Query 1 uses index
SELECT * FROM orders WHERE customer_id = 123;

-- Query 2 also uses same index
SELECT COUNT(*) FROM orders WHERE customer_id = 456;

Benefit: Fixed upfront memory cost with savings for each additional query.

Memory Management

Understanding Memory Footprint

Index memory usage is proportional to:

Current size of the indexed data
Number of columns stored
Data types (complex types use more memory)
Compaction rate

Estimating Index Size

Rough estimation:

Index Size ≈ Source Data Size × (1 + Storage Overhead)

Storage overhead typically ranges from 20-50% depending on data types and compaction.

Memory Best Practices

Index selectively: Don’t index everything—only what queries actually use
Use narrow indexes: Include only necessary columns in keys
Share indexes: Design indexes that multiple queries can use
Monitor usage: Regularly review which indexes are actually used
Consider materialized views: For cross-cluster access, use materialized views instead

Identifying Large Indexes

SELECT 
  i.name,
  c.name AS cluster,
  m.memory_bytes / (1024.0 * 1024.0 * 1024.0) AS memory_gb
FROM mz_indexes i
JOIN mz_clusters c ON i.cluster_id = c.id
JOIN mz_internal.mz_cluster_replica_metrics m 
  ON c.id = (SELECT cluster_id FROM mz_cluster_replicas WHERE id = m.replica_id)
GROUP BY i.name, c.name, m.memory_bytes
ORDER BY memory_gb DESC
LIMIT 10;

Best Practices

Index Design

Match query patterns: Index columns that appear in WHERE clauses
Exact expression matching: Ensure indexed expressions match query expressions exactly
Consider cardinality: High-cardinality columns make better index keys
Order matters for partial lookups: First columns in multi-column indexes are most important

Operational Guidelines

Start minimal: Create indexes only when performance testing reveals the need
Use EXPLAIN: Always verify queries use indexes as expected
Monitor memory: Track index memory usage relative to cluster capacity
Cluster sizing: Ensure clusters have sufficient memory for all indexes
Document purpose: Maintain documentation about why each index exists

Development Workflow

Develop without indexes: Write queries first
Identify slow queries: Use query profiling to find bottlenecks
Create targeted indexes: Add indexes for specific performance problems
Verify improvement: Measure query performance before and after
Clean up: Remove indexes that don’t provide measurable benefit

Troubleshooting

Query Not Using Index

Symptoms: Slow query despite having an index Diagnosis:

EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

Common causes:

Query runs in different cluster than index
Index keys don’t match query conditions
Type mismatch in WHERE clause
Query uses OR between different fields

Solution: Verify cluster, check index keys match exactly, adjust query or index.

High Memory Usage

Symptoms: Cluster running out of memory Diagnosis:

SELECT 
  COUNT(*) AS index_count,
  SUM(size_bytes) / (1024.0 * 1024.0 * 1024.0) AS total_gb
FROM mz_internal.mz_storage_usage
WHERE object_id IN (SELECT id FROM mz_indexes);

Solution: Drop unused indexes, increase cluster size, or use materialized views for shared data.

Slow Index Creation

Symptoms: CREATE INDEX takes a long time Cause: Initial hydration computing all historical results Solution:

Create indexes during low-traffic periods
Ensure sufficient cluster resources
Consider creating indexes in stages for very large datasets

Advanced Topics

Indexes and Arrangements

Internally, indexes use arrangements—Materialize’s optimized data structures. The system automatically:

Shares arrangements between operators
Compacts data to reduce memory
Maintains delta updates for efficiency

Index Ordering Limitations

Materialize does not use index key order for:

ORDER BY clauses
Range queries
Sorted merges

Indexes are optimized for equality lookups and full scans, not ordered access.

Get Started

Core Concepts

Ingesting Data

Transform Data

Serve Results

Operations

Integrations

​Overview

​Understanding Indexes

​What Indexes Store

​Indexes vs. Materialized Views

​When to Create Indexes

​On Sources

​On Views

​On Materialized Views

​Creating Indexes

​Basic Index Creation

​Multi-Column Indexes

​Expression Indexes

​Default Indexes

​Cluster-Specific Indexes

​Index Usage Patterns

​Point Lookups

​Multi-Column Point Lookups

​Expression Matching

​Index Scans

​Verifying Index Usage

​Index Maintenance

​Monitoring Index Memory Usage

​Viewing All Indexes

​Viewing Index Keys

​Dropping Unused Indexes

​Rebuilding Indexes

​Optimization Strategies

​Point Lookup Optimization

​Join Optimization

​Shared Indexes

​Memory Management

​Understanding Memory Footprint

​Estimating Index Size

​Memory Best Practices

​Identifying Large Indexes

​Best Practices

​Index Design

​Operational Guidelines

​Development Workflow

​Troubleshooting

​Query Not Using Index

​High Memory Usage

​Slow Index Creation

​Advanced Topics

​Indexes and Arrangements

​Index Ordering Limitations

​Related Topics

Build docs developers (and LLMs) love