Skip to main content

Introduction

Materialize is a real-time data integration platform that creates and continually updates consistent views of transactional data. Built on top of differential dataflow and timely dataflow, Materialize enables SQL-based stream processing with strong consistency guarantees. This section introduces the core architectural concepts that make Materialize unique:

Sources

Connect to external systems like Kafka, PostgreSQL, and MySQL

Materialized Views

Incrementally maintain query results in durable storage

Indexes

Store query results in memory for instant access

Sinks

Push data to external systems like Kafka

Clusters

Isolated compute resources for workload isolation

Architecture Overview

Materialize’s architecture consists of three main layers:
SQL Front-end   <─┐   Coordinator       Dataflow Layer
SQL Front-end   <─┼──> • Optimizer  <──> • Sources
SQL Front-end   <─┘    • Catalog         • Sinks
                       • Timestamp       • Compute

Key Architectural Components

The coordinator is the “brains” of Materialize. It manages:
  • Metadata and catalog information
  • Query execution and optimization
  • Timestamp selection for linearizability
  • Frontier tracking for data completeness
Responsible for:
  • Data persistence
  • Sources and sinks
  • Reclocking for timestamp alignment
  • Change data capture ingestion
Handles:
  • Dataflow execution using differential dataflow
  • Horizontal scaling across replicas
  • Incremental view maintenance
  • Active replication for fault tolerance

Differential Dataflow

At its core, Materialize uses differential dataflow to process data as streams of updates. Each update is represented as a triple:
(data, time, diff)
  • data: The actual row data
  • time: A logical timestamp (e.g., transaction ID, milliseconds since epoch)
  • diff: An integer representing the change (+1 for insert, -1 for delete)
This representation enables:
1

Incremental Computation

Only the changes are processed, not entire datasets
2

Multiversion State

Historical data access at any timestamp
3

Efficient Joins

Delta joins avoid intermediate state blowup
4

Consistency

All operations maintain transactional consistency

Reaction Time

Materialize optimizes for reaction time — the total delay from data change to queryable result:
Reaction Time = Freshness + Query Latency

OLTP Systems

High Reaction Time
  • Excellent freshness
  • Poor query latency for analytics

Data Warehouses

High Reaction Time
  • Poor freshness (batch ingestion)
  • Excellent query latency

Materialize

Low Reaction Time
  • Excellent freshness (streaming)
  • Excellent query latency (incremental)

Consistency Guarantees

Materialize provides linearizability — the strongest consistency guarantee possible. This means:
  • Query results reflect a consistent snapshot across all data sources
  • Operations within a transaction maintain atomicity
  • Cross-source joins produce consistent results
  • No eventual consistency or approximate answers

How It Works

The coordinator determines timestamps for queries by tracking:
  1. Lower bounds: Times with valid data available
  2. Upper bounds: Times with complete data available
  3. Logical compaction frontiers: Oldest queryable timestamps
Every query executes at a specific timestamp, ensuring consistent reads even when data is constantly changing.

Data Flow Model

Data flows through Materialize in this sequence:
1

Ingestion

Sources continuously read from external systems (Kafka, PostgreSQL, MySQL)
2

Transformation

Materialized views apply SQL transformations incrementally
3

Storage

Results persist in durable storage with incremental updates
4

Indexing

Indexes load results into memory for fast queries
5

Serving

Queries read from indexes or sinks push to external systems

Common Use Cases

Materialize excels at three primary patterns:

Query Offload (CQRS)

Scale complex read queries more efficiently than read replicas:
-- Create a materialized view for complex aggregations
CREATE MATERIALIZED VIEW customer_metrics AS
SELECT 
    customer_id,
    COUNT(*) as order_count,
    SUM(total_amount) as lifetime_value,
    AVG(total_amount) as avg_order_value
FROM orders
GROUP BY customer_id;

-- Index for fast lookups
CREATE INDEX idx_customer ON customer_metrics (customer_id);

Integration Hub (ODS)

Combine data from multiple sources into unified views:
-- Join across PostgreSQL and Kafka sources
CREATE MATERIALIZED VIEW enriched_events AS
SELECT 
    e.event_id,
    e.user_id,
    u.email,
    u.segment,
    e.event_type,
    e.timestamp
FROM kafka_events e
JOIN postgres_users u ON e.user_id = u.id;

Operational Data Mesh (ODM)

Create real-time data products for downstream consumers:
-- Maintain real-time inventory levels
CREATE MATERIALIZED VIEW inventory_status AS
SELECT 
    product_id,
    SUM(CASE WHEN type = 'purchase' THEN quantity ELSE 0 END) -
    SUM(CASE WHEN type = 'sale' THEN quantity ELSE 0 END) as current_stock
FROM inventory_events
GROUP BY product_id;

-- Stream to Kafka for downstream services
CREATE SINK inventory_sink
FROM inventory_status
INTO KAFKA CONNECTION kafka_conn (TOPIC 'inventory-status')
FORMAT JSON;

Next Steps

Dive deeper into specific concepts:

Learn About Sources

Discover how to connect Materialize to your data sources

Explore Materialized Views

Understand incremental view maintenance

Understand Indexes

Learn about in-memory query acceleration

Configure Clusters

Optimize compute resources and isolation

Build docs developers (and LLMs) love