Overview
Materialize has six fundamental concepts that work together to provide real-time data transformation:Sources
External systems from which Materialize reads data
Views
Named SQL queries for data transformation
Materialized Views
Views with results persisted in durable storage
Indexes
Query results stored in memory for fast access
Clusters
Isolated pools of compute resources
Sinks
Connections to push data to external systems
Sources
Sources describe external systems you want Materialize to read data from and provide details about how to decode and interpret that data.Mental Model
Think of sources as a combination of:- Tables — they’re structured components you can read from
- Clients — they’re responsible for reading data from external systems
Source Components
Every source consists of three components:| Component | Purpose | Example |
|---|---|---|
| Connector | Provides the actual bytes of data to Materialize | Kafka, PostgreSQL, MySQL |
| Format | Defines how to structure the external data (schema) | Avro, JSON, Protobuf, CSV |
| Envelope | Specifies how Materialize should handle incoming data | Upsert, Debezium, Append-only |
Supported Connectors
Materialize provides native connectors for:PostgreSQL
Read data via logical replication (CDC)
MySQL
Read data via binary log replication (CDC)
Kafka & Redpanda
Read from Kafka topics in real-time
SQL Server
Read data via change data capture (CDC)
Webhooks
Ingest data from SaaS applications via HTTP
Load Generators
Generate sample data for testing
Creating a Source
- PostgreSQL
- MySQL
- Kafka
- Webhook
Subsources
When you create a source from a system with multiple tables (like PostgreSQL or MySQL), Materialize creates subsources — one for each upstream table.- Write-only by the source (only the source can write to them)
- Readable by users (you can query them like tables)
Sources and Clusters
Sources require compute resources, so they must be associated with a cluster.Best Practice: In production, dedicate a separate cluster just for sources. This isolates ingestion workloads from transformation and serving workloads.
Views
Views represent queries saved under a name for reference. They provide a shorthand for underlying SQL queries.How Views Work
- The query is not executed during view creation
- Results are recomputed from scratch each time you query the view
- Views are not associated with a cluster and can be referenced anywhere
Indexed Views
In Materialize, you can create an index on a view to keep results incrementally updated in memory:- The query is executed and results are stored in memory
- As new data arrives, results are incrementally updated
- Queries read from memory — making them fast and computationally free
Indexes are local to a cluster. An indexed view’s results are only available in the cluster where the index was created.
Materialized Views
Materialized views are views whose results are persisted in durable storage and incrementally updated as new data arrives.Creating Materialized Views
- Query is executed during creation
- Results are stored in durable storage (not just memory)
- Results are incrementally updated as data changes
- Can be queried from any cluster (cross-cluster access)
Hydration
Materialized views undergo hydration when:- The view is first created
- Its cluster is restarted
Indexes on Materialized Views
You can index materialized views to make results available in memory within a specific cluster:- Require no additional computation to stay current
- Simply make results available in memory for faster access
- Are useful in serving clusters that need low-latency access
Querying a materialized view is computationally free from any cluster (indexed or not). However, querying an indexed materialized view from the cluster with the index is faster because results are served from memory instead of storage.
Indexes
Indexes represent query results stored in memory within a cluster. They enable fast, computationally-free queries by maintaining incrementally updated results.How Indexes Work
When you create an index:- The underlying query is executed
- Results are stored in memory within the cluster
- As new data arrives, results are incrementally updated
- Queries read from memory — making them instant and free
Creating Indexes
- On Views
- On Materialized Views
- On Sources
How Materialize Uses Indexes
Unlike traditional databases, Materialize can use an index to serve queries even withoutWHERE conditions on indexed columns.
Consider this index:
| Query | Index Usage |
|---|---|
SELECT * FROM orders_view | Full scan of the index |
SELECT * FROM orders_view WHERE status = 'shipped' | Full scan + filter |
SELECT * FROM orders_view WHERE customer_id = 123 | Point lookup (fastest) |
SELECT * FROM orders_view WHERE customer_id IN (123, 456) | Point lookup |
Point Lookups
Point lookups are the most efficient index usage. Materialize performs a point lookup when:- The
WHEREclause uses equality (=orIN) on all indexed columns - Only
ANDis used to combine conditions (noOR) - The condition uses the exact expression type from the index
Multi-Column Indexes
For point lookups on multi-column indexes, you must specify all columns:Indexes and Query Optimization
Indexes help optimize:- Sequential access — faster than scanning unindexed data
- Random access — fast point lookups for specific keys
- Join operations — delta joins use indexes for efficiency
- Ad-hoc queries — even queries not optimized for the index benefit
Clusters
Clusters are isolated pools of compute resources (CPU, memory, and scratch disk space) for running your workloads.What Requires a Cluster?
The following operations need compute resources and must be associated with a cluster:- Maintaining sources and sinks
- Maintaining indexes and materialized views
- Executing
SELECTandSUBSCRIBEstatements
Resource Isolation
Clusters provide complete resource isolation:- Each cluster has dedicated CPU, memory, and disk
- Workloads on different clusters can’t affect each other
- Clusters can fail independently
Cluster Sizing
Cluster size determines:- Resource allocation (CPU, memory, disk)
- Processing speed
- Cost
25cc, 50cc, 100cc, 200cc, 400cc, etc.
Replication Factor
The replication factor determines fault tolerance:- Each replica does the same work on the same data
- Provides high availability if one replica fails
- Increases cost proportionally (2 replicas = 2× cost)
Materialize automatically spreads replicas across availability zones for maximum fault tolerance.
Three-Tier Architecture (Production Best Practice)
For production workloads, use separate clusters:Compute/Transformation Cluster(s)
Dedicated to maintaining materialized views and complex transformations.
- Isolation — ingestion issues don’t affect query serving
- Scalability — size each tier independently
- Availability — replicate serving clusters for high availability
Sinks
Sinks are the inverse of sources — they represent connections to external systems where Materialize outputs data.What Can You Sink?
You can create sinks from:- Materialized views
- Sources
- Tables
Supported Sink Types
- Kafka
- SUBSCRIBE
Stream changes to a Kafka topic:
Sinks and Clusters
Best Practice: Avoid putting sinks on the same cluster that hosts sources. Separate ingestion and egress workloads for better isolation.
Hydration Considerations
During creation, Kafka sinks need to load an entire snapshot of the data in memory. Ensure your cluster has sufficient memory.Putting It All Together
Here’s a complete example showing how all concepts work together:Decision Guide: Views, Materialized Views, and Indexes
Use this table to decide which approach fits your use case:
| Use Case | Recommended Approach |
|---|---|
| Intermediate transformation (not queried directly) | View (no index) |
| Results needed in one cluster only | Indexed view |
| Results shared across multiple clusters | Materialized view |
Final output for SUBSCRIBE consumers | Materialized view |
| Fast serving in high-availability cluster | Materialized view + Index in serving cluster |
| Point lookups on specific columns | Index with those columns as keys |
| Low-traffic, ad-hoc queries | View (no index) |
| High-traffic, repeated queries | Indexed view or Materialized view |
Best Practices Summary
Cluster Architecture
Cluster Architecture
- Use three-tier architecture in production (ingestion, compute, serving)
- Isolate sources and sinks on dedicated clusters
- Size clusters based on actual workload requirements
- Use replication for serving clusters that need high availability
Indexes
Indexes
- Create indexes on views that are frequently queried
- Index the columns used in
WHEREclauses for point lookups - Remember indexes are local to a cluster
- Monitor memory usage — indexes live in memory
Materialized Views
Materialized Views
- Use for results that must be shared across clusters
- Use for
SUBSCRIBEoperations - Consider hydration time when creating or restarting
- Index materialized views in serving clusters for low latency
Sources
Sources
- Dedicate a cluster for sources if possible
- Use upsert sources on larger clusters (they maintain state)
- Monitor ingestion lag and throughput
- Consider separate clusters for high-volume sources
Next Steps
Quickstart
Try these concepts hands-on with a complete example
SQL Reference
Explore the complete SQL syntax reference
Optimization Guide
Learn how to optimize query performance
Production Checklist
Best practices for running Materialize in production