Available Data Sources
Materialize supports ingesting data from the following external systems:PostgreSQL
Stream data from PostgreSQL databases using Change Data Capture (CDC)
MySQL
Stream data from MySQL databases using binlog replication
SQL Server
Stream data from SQL Server databases using Change Data Capture (CDC)
Kafka
Consume messages from Kafka and Redpanda topics
Webhooks
Accept HTTP POST requests from webhook providers
Core Concepts
Sources and Clusters
Sources in Materialize require a cluster to provide the compute resources needed to ingest data. Best practices:Connections
Connections describe how to connect and authenticate to external systems. Once created, a connection is reusable across multipleCREATE SOURCE statements.
Data Ingestion Lifecycle
1. Snapshotting
When a new source is created, Materialize performs a sync of all data available in the external system before it starts ingesting new data — an operation known as snapshotting. Key considerations:- The duration depends on the volume of data and the size of the cluster
- Run source creation during off-peak hours when possible
- Limit the volume of data that is synced into Materialize
- For upsert sources, consider using a larger cluster during snapshotting
2. Running/Steady-State
Once snapshotting completes, Materialize transitions to running state and continually ingests changes from the upstream system in real-time.3. Hydration
When a cluster is restarted (such as after resizing), objects on that cluster undergo hydration — the reconstruction of in-memory state by reading data from Materialize’s storage layer. This does not require reading from the upstream system.Best Practices
Scheduling
For production deployments:- Create sources during off-peak hours to minimize operational risk
- Plan for the initial snapshotting duration based on data volume
- Monitor CPU and memory utilization during snapshotting
Resource Management
Dedicate a cluster for sources:- Start with a larger cluster for snapshotting (especially for upsert sources)
- Downsize to align with steady-state resource needs after snapshotting completes
Limit Data Volume
Ingest only the data you need:- For PostgreSQL: Create publications with specific tables instead of all tables
- For MySQL: Select specific schemas or tables
- For Kafka: Use appropriate topic filtering
Network Security
Materialize supports multiple network security options:AWS PrivateLink (Cloud Only)
Securely connect to resources in your AWS VPC without exposing them to the public internet:SSH Tunnel
Connect through an SSH bastion host:Static IP Allowlist
Allow connections from Materialize’s static egress IP addresses:Monitoring Data Ingestion
Check Source Status
running: Source is actively ingesting datastarting: Source is initializingpaused: Cluster has 0 replicasstalledorfailed: Configuration issue (check the error field)
Monitor Ingestion Progress
Troubleshooting
Source Not Ingesting Data
- Check the source status in the console or via
mz_source_statuses - Verify the cluster has at least 1 replica
- Check for configuration errors in the error message
- Ensure network connectivity to the upstream system
Slow Snapshotting
- Scale up the cluster size temporarily:
- After snapshotting completes, scale back down:
Memory Issues with Upsert Sources
Upsert sources (including Debezium-formatted sources) can be memory-intensive:- Use standard-sized clusters that automatically spill to disk
- Start with a larger cluster size for snapshotting
- Monitor memory utilization during steady-state
Next Steps
PostgreSQL CDC
Set up Change Data Capture from PostgreSQL
MySQL CDC
Configure binlog replication for MySQL
Kafka Streaming
Consume Kafka topics with various formats
Webhook Sources
Create HTTP endpoints for webhook data