Skip to main content
Metadb continuously synchronizes data from external data sources, which could be transaction-processing databases, sensor networks, or other streaming systems. The platform is designed to support multiple data sources and keeps the database updated based on state changes in these external systems.

How Data Sources Work

Data sources in Metadb stream changes continuously to keep your analytics database synchronized with the source systems. When data changes in the source, those changes flow through to Metadb, which updates its tables accordingly.
Metadb extends PostgreSQL with streaming data source capabilities, allowing you to build analytics on top of continuously updating data without manual ETL processes.

Supported Source Types

Currently, Metadb supports Kafka as a data source type, which enables integration with systems that use change data capture (CDC) to stream database changes.

Kafka Data Sources

Kafka data sources connect to Kafka brokers and consume messages from specified topics. You can configure:
  • Brokers: Bootstrap servers for the Kafka cluster
  • Topics: Regular expressions matching topics to read
  • Consumer Groups: Kafka consumer group ID for offset management
  • Security: SSL or plaintext protocols
  • Filters: Schema and table filtering rules

Creating a Data Source

To create a new data source, use the create data source command:
create data source sensor type kafka options (
    brokers 'kafka:29092',
    topics '^metadb_sensor_1\.',
    consumer_group 'metadb_sensor_1_1',
    add_schema_prefix 'sensor_',
    table_stop_filter '^testing\.air_temp$,^testing\.air_temp_avg$'
);
1

Define source name and type

Choose a unique name for your data source and specify the type (currently kafka)
2

Configure connection options

Provide broker addresses, topics, consumer group, and security settings
3

Set up filtering rules

Use schema_pass_filter, schema_stop_filter, and table_stop_filter to control which tables are synchronized
4

Wait for synchronization

The source starts in synchronizing mode. After the initial snapshot completes, run metadb endsync to finish setup

Data Origin Tracking

The __origin column in Metadb tables allows you to track where data came from. This is especially useful when combining data from multiple sources into a single table.
select __id, __start, __origin, id, groupname, description 
    from library.patrongroup;
__id__start__originidgroupnamedescription
82022-04-18 19:27:18-00west15undergradUndergraduate Student
42022-04-17 17:42:25-00east10graduateGraduate Student
Origins allow grouping data independently of data sources. While data sources may be dictated by how data is collected (e.g., geographically in a sensor network), origins provide logical grouping based on your application needs.

Managing Data Sources

Modifying a Data Source

Change data source settings using alter data source:
alter data source sensor options (
    set consumer_group 'metadb_sensor_1'
);
Changes to data sources currently require restarting the Metadb server to take effect.

Removing a Data Source

Remove a data source configuration:
drop data source sensor;

Configuration Options

Schema and Table Filtering

Control which schemas and tables are synchronized:
create data source myapp type kafka options (
    brokers 'kafka:9092',
    topics '^myapp\.',
    consumer_group 'metadb_myapp',
    schema_pass_filter '^production\.',
    schema_stop_filter '^test\.',
    table_stop_filter '^temp_.*$'
);

Schema Name Mapping

Modify schema names during synchronization:
  • trim_schema_prefix: Remove a prefix from schema names
  • add_schema_prefix: Add a prefix to schema names
  • map_public_schema: Map tables from the public schema to a different target schema

Monitoring Data Sources

Check the status of your data sources:
list data_sources;
View when tables were last updated:
select * from metadb.table_update 
    order by schema_name, table_name;
Check the system log for data source events:
select * from mdblog();

Best Practices

Use Consumer Groups Wisely

Each Metadb instance should use a unique consumer group ID to maintain independent offset tracking.

Filter Unnecessary Tables

Use table and schema filters to reduce unnecessary data synchronization and improve performance.

Plan for Initial Snapshots

Initial snapshots can take significant time. Monitor the logs and wait for the “snapshot complete” message before running endsync.

Track Origins for Multi-Source Setups

When combining data from multiple sources, use the __origin column to distinguish the source of each record.

Build docs developers (and LLMs) love