Data Sources

Metadb continuously synchronizes data from external data sources, which could be transaction-processing databases, sensor networks, or other streaming systems. The platform is designed to support multiple data sources and keeps the database updated based on state changes in these external systems.

How Data Sources Work

Data sources in Metadb stream changes continuously to keep your analytics database synchronized with the source systems. When data changes in the source, those changes flow through to Metadb, which updates its tables accordingly.

Metadb extends PostgreSQL with streaming data source capabilities, allowing you to build analytics on top of continuously updating data without manual ETL processes.

Supported Source Types

Currently, Metadb supports Kafka as a data source type, which enables integration with systems that use change data capture (CDC) to stream database changes.

Kafka Data Sources

Kafka data sources connect to Kafka brokers and consume messages from specified topics. You can configure:

Brokers: Bootstrap servers for the Kafka cluster
Topics: Regular expressions matching topics to read
Consumer Groups: Kafka consumer group ID for offset management
Security: SSL or plaintext protocols
Filters: Schema and table filtering rules

Creating a Data Source

To create a new data source, use the create data source command:

create data source sensor type kafka options (
    brokers 'kafka:29092',
    topics '^metadb_sensor_1\.',
    consumer_group 'metadb_sensor_1_1',
    add_schema_prefix 'sensor_',
    table_stop_filter '^testing\.air_temp$,^testing\.air_temp_avg$'
);

Define source name and type

Choose a unique name for your data source and specify the type (currently kafka)

Configure connection options

Provide broker addresses, topics, consumer group, and security settings

Set up filtering rules

Use schema_pass_filter, schema_stop_filter, and table_stop_filter to control which tables are synchronized

Wait for synchronization

The source starts in synchronizing mode. After the initial snapshot completes, run metadb endsync to finish setup

Data Origin Tracking

The __origin column in Metadb tables allows you to track where data came from. This is especially useful when combining data from multiple sources into a single table.

select __id, __start, __origin, id, groupname, description 
    from library.patrongroup;

`__id`	`__start`	`__origin`	`id`	`groupname`	`description`
8	2022-04-18 19:27:18-00	west	15	undergrad	Undergraduate Student
4	2022-04-17 17:42:25-00	east	10	graduate	Graduate Student

Origins allow grouping data independently of data sources. While data sources may be dictated by how data is collected (e.g., geographically in a sensor network), origins provide logical grouping based on your application needs.

Managing Data Sources

Modifying a Data Source

Change data source settings using alter data source:

alter data source sensor options (
    set consumer_group 'metadb_sensor_1'
);

Changes to data sources currently require restarting the Metadb server to take effect.

Removing a Data Source

Remove a data source configuration:

drop data source sensor;

Configuration Options

Schema and Table Filtering

Control which schemas and tables are synchronized:

create data source myapp type kafka options (
    brokers 'kafka:9092',
    topics '^myapp\.',
    consumer_group 'metadb_myapp',
    schema_pass_filter '^production\.',
    schema_stop_filter '^test\.',
    table_stop_filter '^temp_.*$'
);

Schema Name Mapping

Modify schema names during synchronization:

trim_schema_prefix: Remove a prefix from schema names
add_schema_prefix: Add a prefix to schema names
map_public_schema: Map tables from the public schema to a different target schema

Monitoring Data Sources

Check the status of your data sources:

list data_sources;

View when tables were last updated:

select * from metadb.table_update 
    order by schema_name, table_name;

Check the system log for data source events:

select * from mdblog();

Best Practices

Use Consumer Groups Wisely

Each Metadb instance should use a unique consumer group ID to maintain independent offset tracking.

Filter Unnecessary Tables

Use table and schema filters to reduce unnecessary data synchronization and improve performance.

Plan for Initial Snapshots

Initial snapshots can take significant time. Monitor the logs and wait for the “snapshot complete” message before running endsync.

Track Origins for Multi-Source Setups

When combining data from multiple sources, use the __origin column to distinguish the source of each record.

Get Started

Core Concepts

User Guide

Administration

How Data Sources Work

Supported Source Types

Kafka Data Sources

Creating a Data Source

Data Origin Tracking

Managing Data Sources

Modifying a Data Source

Removing a Data Source

Configuration Options

Schema and Table Filtering

Schema Name Mapping

Monitoring Data Sources

Best Practices

Use Consumer Groups Wisely

Filter Unnecessary Tables

Plan for Initial Snapshots

Track Origins for Multi-Source Setups

Build docs developers (and LLMs) love

Get Started

Core Concepts

User Guide

Administration

​How Data Sources Work

​Supported Source Types

​Kafka Data Sources

​Creating a Data Source

​Data Origin Tracking

​Managing Data Sources

​Modifying a Data Source

​Removing a Data Source

​Configuration Options

​Schema and Table Filtering

​Schema Name Mapping

​Monitoring Data Sources

​Best Practices

Use Consumer Groups Wisely

Filter Unnecessary Tables

Plan for Initial Snapshots

Track Origins for Multi-Source Setups

Build docs developers (and LLMs) love

How Data Sources Work

Supported Source Types

Kafka Data Sources

Creating a Data Source

Data Origin Tracking

Managing Data Sources

Modifying a Data Source

Removing a Data Source

Configuration Options

Schema and Table Filtering

Schema Name Mapping

Monitoring Data Sources

Best Practices