Indexes in Materialize represent query results stored in memory within a cluster. Unlike traditional database indexes that simply speed up lookups, Materialize indexes store the complete, up-to-date results of queries and maintain them incrementally as data changes.
-- Create a viewCREATE VIEW recent_orders ASSELECT * FROM ordersWHERE order_time > NOW() - INTERVAL '1 hour';-- Index the view in memoryCREATE INDEX idx_recent_orders ON recent_orders(customer_id);
Indexes make queries computationally free — results are pre-computed and served directly from memory with no query-time computation.
Storage: In-memory (RAM)Scope: Local to a clusterSpeed: Microsecond query latencyDurability: Lost on cluster restartBest for: Fast reads within a cluster
Materialized Views
Storage: Durable storage (S3/disk)Scope: Accessible from any clusterSpeed: Millisecond to sub-second latencyDurability: Survives restartsBest for: Cross-cluster sharing
-- Scenario 1: Single cluster, fast queriesCREATE VIEW customer_summary ASSELECT customer_id, COUNT(*) as ordersFROM orders GROUP BY customer_id;CREATE INDEX idx_summary ON customer_summary(customer_id);-- Result: Fast queries in current cluster only-- Scenario 2: Multiple clusters need same dataCREATE MATERIALIZED VIEW customer_summary ASSELECT customer_id, COUNT(*) as ordersFROM orders GROUP BY customer_id;CREATE INDEX idx_cluster1 IN CLUSTER serving1ON customer_summary(customer_id);CREATE INDEX idx_cluster2 IN CLUSTER serving2ON customer_summary(customer_id);-- Result: Both clusters can query fast from memory
Rule of thumb: Use indexed views for single-cluster workloads. Use materialized views + indexes for multi-cluster architectures or when you need durability.
Unlike traditional databases, Materialize can use indexes even without WHERE clause filters:
CREATE INDEX idx_orders ON orders_view(status);
This index accelerates all of these queries:
-- Full scan (uses index)SELECT * FROM orders_view;-- Filter on indexed column (point lookup)SELECT * FROM orders_view WHERE status = 'shipped';-- Filter on non-indexed column (scans index)SELECT * FROM orders_view WHERE amount > 100;-- Join (uses index for lookup)SELECT * FROM orders_view oJOIN customers c ON o.customer_id = c.idWHERE o.status = 'shipped';
CREATE VIEW order_summary ASSELECT customer_id, COUNT(*) as order_count, SUM(amount) as total_spentFROM ordersGROUP BY customer_id;-- Index in current clusterCREATE INDEX idx_summary ON order_summary(customer_id);-- Index in specific clusterCREATE INDEX idx_summary_servingIN CLUSTER serving_clusterON order_summary(customer_id);
CREATE MATERIALIZED VIEW customer_metrics ASSELECT customer_id, AVG(amount) as avg_order, MAX(amount) as max_orderFROM ordersGROUP BY customer_id;-- Index the materialized view for fast queriesCREATE INDEX idx_metricsIN CLUSTER serving1ON customer_metrics(customer_id);
Indexing a materialized view requires no additional computation — the results already exist in durable storage and are simply loaded into memory.
CREATE SOURCE kafka_eventsFROM KAFKA CONNECTION kafka_conn (TOPIC 'events')FORMAT JSON;-- Index the source directlyCREATE INDEX idx_events ON kafka_events(user_id);
Indexing sources directly is rare in practice. Usually you’ll create views that transform source data, then index those views.
-- Index on computed columnCREATE INDEX idx_monthly ON orders( DATE_TRUNC('month', order_time));-- Query must match exact expressionSELECT * FROM ordersWHERE DATE_TRUNC('month', order_time) = '2024-01-01';
-- Index on multiple expressionsCREATE INDEX idx_computed ON orders( LOWER(customer_name), ROUND(amount, 2));-- Point lookupSELECT * FROM ordersWHERE LOWER(customer_name) = 'alice' AND ROUND(amount, 2) = 49.99;
Type mismatches prevent point lookups:
CREATE INDEX idx ON orders(quantity); -- quantity is integerSELECT * FROM orders WHERE quantity = 5; -- Point lookup ✓SELECT * FROM orders WHERE quantity = 5.0; -- Index scan (float ≠ int)
Join operators require both inputs to be arranged (indexed):
CREATE VIEW enriched_orders ASSELECT o.order_id, c.customer_name, o.amountFROM orders oJOIN customers c ON o.customer_id = c.id;CREATE INDEX idx_enriched ON enriched_orders(order_id);
Materialize automatically creates intermediate arrangements for joins. You can query mz_arrangement_sizes to see them:
SELECT operator_name, pg_size_pretty(size) as memory_usedFROM mz_arrangement_sizesJOIN mz_dataflow_operators USING (operator_id)WHERE dataflow_id IN ( SELECT dataflow_id FROM mz_dataflows WHERE name = 'enriched_orders');
-- Index sizes by nameSELECT mz_indexes.name, pg_size_pretty(SUM(mz_arrangement_sizes.size)) as total_sizeFROM mz_indexesJOIN mz_arrangement_sizes ON mz_indexes.id = mz_arrangement_sizes.object_idGROUP BY mz_indexes.nameORDER BY SUM(mz_arrangement_sizes.size) DESC;-- Total memory by clusterSELECT mz_clusters.name, pg_size_pretty(SUM(mz_arrangement_sizes.size)) as memory_usedFROM mz_clustersJOIN mz_indexes ON mz_indexes.cluster_id = mz_clusters.idJOIN mz_arrangement_sizes ON mz_indexes.id = mz_arrangement_sizes.object_idGROUP BY mz_clusters.name;
Indexes are local to clusters and don’t transfer between them:
-- Create indexes in multiple clustersCREATE INDEX idx_serving1 IN CLUSTER serving1ON customer_summary(customer_id);CREATE INDEX idx_serving2 IN CLUSTER serving2 ON customer_summary(customer_id);-- Queries in serving1 use idx_serving1-- Queries in serving2 use idx_serving2-- Queries in other clusters don't use indexes
Three-tier architecture pattern:
Ingest cluster: Sources
Transform cluster: Materialized views (no indexes)
Serving clusters: Indexes on materialized views
This separates ingestion, transformation, and query serving workloads.
Materialize can answer some queries without indexes using dataflow operators:
-- No index exists on ordersSELECT COUNT(*) FROM orders WHERE status = 'pending';
Materialize:
Constructs a dataflow: Source → Filter → Count
Processes the entire dataset once
Returns results
With index:
CREATE VIEW pending_count ASSELECT COUNT(*) FROM orders WHERE status = 'pending';CREATE INDEX idx_pending ON pending_count();-- Query returns instantly from memorySELECT * FROM pending_count;
Index-free queries still execute efficiently but require compute time proportional to data size. Indexes make queries O(1) — constant time.
-- Index only filtered dataCREATE VIEW recent_orders ASSELECT * FROM ordersWHERE order_time > NOW() - INTERVAL '7 days';CREATE INDEX idx_recent ON recent_orders(customer_id);-- Uses less memory than indexing all orders
-- Both views share the same arrangementCREATE VIEW view1 ASSELECT customer_id, SUM(amount)FROM ordersGROUP BY customer_id;CREATE VIEW view2 ASSELECT customer_id, COUNT(*)FROM ordersGROUP BY customer_id;-- Single index serves bothCREATE INDEX idx_orders ON orders(customer_id);
-- Good: customer_id is selectiveCREATE INDEX idx ON orders(customer_id);-- Less useful: status has few distinct valuesCREATE INDEX idx ON orders(status);
Monitor Arrangement Sizes
Large arrangements indicate optimization opportunities:
SELECT name, pg_size_pretty(size)FROM mz_indexesJOIN mz_arrangement_sizes USING (id)WHERE size > 1024*1024*1024 -- > 1GBORDER BY size DESC;
Drop Unused Indexes
Indexes consume memory even when not queried:
DROP INDEX idx_unused;
Monitor query patterns and remove indexes that aren’t being used.
Use Default Indexes Sparingly
Creating a default index on all columns:
CREATE DEFAULT INDEX ON my_view;
Convenient but uses more memory than selective indexes. Prefer explicit column lists.
Build a real-time dashboard with optimized indexes:
-- Ingest orders from PostgreSQLCREATE SOURCE pg_sourceFROM POSTGRES CONNECTION pg_conn(PUBLICATION 'mz_source');-- Transform: Customer metricsCREATE VIEW customer_metrics ASSELECT customer_id, COUNT(*) as order_count, SUM(total) as lifetime_value, MAX(order_time) as last_orderFROM pg_source_ordersGROUP BY customer_id;-- Index for customer lookup pageCREATE INDEX idx_customer IN CLUSTER servingON customer_metrics(customer_id);-- Transform: Top productsCREATE VIEW top_products ASSELECT product_id, COUNT(*) as sales_count, SUM(quantity) as total_quantityFROM pg_source_order_itemsWHERE order_time > NOW() - INTERVAL '24 hours'GROUP BY product_idORDER BY sales_count DESCLIMIT 100;-- Index for dashboard main pageCREATE DEFAULT INDEX ON top_products;-- Transform: Real-time revenueCREATE VIEW revenue_by_hour ASSELECT DATE_TRUNC('hour', order_time) as hour, COUNT(*) as order_count, SUM(total) as revenueFROM pg_source_ordersWHERE order_time > NOW() - INTERVAL '7 days'GROUP BY hour;-- Index for time-series chartCREATE INDEX idx_revenue IN CLUSTER servingON revenue_by_hour(hour);