Sync Modes - Duckling

Overview

Duckling supports three sync modes optimized for different use cases:

Full Sync - Complete data refresh using Appender API (60K+ rows/sec)
Incremental Sync - Watermark-based delta processing (10K rows/sec)
CDC (Change Data Capture) - Real-time binlog streaming (sub-second latency)

Full Sync

Use Cases

Initial database setup
Schema changes requiring rebuild
Recovery from sync failures
Tables without timestamp columns

How It Works

Create staging table with same schema
Stream all records from MySQL → Append to staging
Atomic swap: DELETE + INSERT in transaction
CHECKPOINT to ensure durability
Update watermark for next incremental sync

API Endpoint

curl -X POST 'http://localhost:3001/api/sync/full?db=lms' \
  -H 'Authorization: Bearer ${DUCKLING_API_KEY}'

Performance

Speed

60,000+ rows/sec using Appender API

Memory

<500 MB for datasets up to 60M records

Atomicity

ACID transactions prevent partial states

Crash-Safe

Staging tables protect production data

Example Output

{
  "totalTables": 45,
  "successfulTables": 45,
  "failedTables": 0,
  "totalRecords": 12500000,
  "totalDuration": 320000,
  "syncDetails": {
    "sequential": 45,
    "watermark": 0
  }
}

Full sync for a 60M record database completes in ~16 minutes (62,500 rows/sec average). The same operation with traditional INSERT would take 1.6+ hours.

Incremental Sync

Use Cases

Regular scheduled updates (default: every 15 minutes)
Tables with updatedAt/createdAt columns
Efficient delta processing
Production steady-state operation

Watermark-Based Tracking

Incremental sync uses watermarks to track the last processed record:

CREATE TABLE appender_watermarks (
  table_name VARCHAR PRIMARY KEY,
  last_processed_id BIGINT,
  last_processed_timestamp TIMESTAMP,
  primary_key_column VARCHAR,
  timestamp_column VARCHAR,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Timestamp Detection Priority

The system automatically detects the best timestamp column:

updatedAt / updated_at / modifiedAt / modified_at (highest priority)
- Captures all record modifications
- Best for transactional tables (User, Order, Product)
createdAt / created_at (fallback)
- For append-only tables (Event, Log, Audit)
- Tracks new records only
timestamp (final fallback)
- Generic timestamp column
- Legacy systems

Query Pattern

-- Incremental query (note: uses >= not >)
SELECT * FROM TableName 
WHERE updatedAt >= '2025-10-30 12:00:00'
ORDER BY updatedAt ASC

The system uses >= (not >) to prevent data loss at timestamp boundaries. The last record is re-processed each sync but safely handled by INSERT OR REPLACE.

INSERT OR REPLACE Behavior

Incremental sync uses upsert for all records:

INSERT OR REPLACE INTO TableName (col1, col2, ...) VALUES (?, ?, ...)

Scenario	Behavior
New record (PK not exists)	INSERT
Updated record (PK exists)	REPLACE (update)
Duplicate re-processing	REPLACE (idempotent)

API Endpoint

curl -X POST 'http://localhost:3001/api/sync/incremental?db=lms' \
  -H 'Authorization: Bearer ${DUCKLING_API_KEY}'

Performance

Speed

10,000 rows/sec using INSERT OR REPLACE

Efficiency

Only processes changed records

Idempotent

Safe to re-run without duplicates

Fallback

Auto-switches to full sync if no watermark

Example Output

{
  "totalTables": 45,
  "successfulTables": 45,
  "failedTables": 0,
  "totalRecords": 1250,
  "totalDuration": 8500,
  "syncDetails": {
    "sequential": 3,
    "watermark": 42
  }
}

Incremental sync processes only changed records. For a database with 1,000 updates/15min, sync completes in ~8 seconds vs 16 minutes for full sync.

CDC (Change Data Capture)

Use Cases

Real-time data replication (sub-second latency)
Event-driven architectures
Live dashboards and analytics
Minimal replication lag requirements

How It Works

CDC streams MySQL binlog events in real-time:

MySQL Binlog → @vlasky/zongji → Event Queue → DuckDB
     ↓              ↓              ↓           ↓
  WriteRows    Parse Event    Serialize   INSERT
  UpdateRows   + Sanitize     Processing  UPDATE
  DeleteRows   + Filter       (FIFO)      DELETE

Implementation (`cdcService.ts`)

import ZongJi from '@vlasky/zongji';

const zongji = new ZongJi({
  host: config.mysqlHost,
  port: config.mysqlPort,
  user: config.mysqlUser,
  password: config.mysqlPassword,
  ssl: { rejectUnauthorized: true }
});

zongji.on('binlog', async (event) => {
  const eventName = event.getTypeName();
  const tableName = event.tableMap[event.tableId].tableName;
  
  switch (eventName) {
    case 'WriteRows':
      await this.handleInsert(tableName, event.rows);
      break;
    case 'UpdateRows':
      await this.handleUpdate(tableName, event.rows);
      break;
    case 'DeleteRows':
      await this.handleDelete(tableName, event.rows);
      break;
  }
  
  // Save binlog position for resume
  await this.savePosition(event.binlogName, event.nextPosition);
});

Event Processing

INSERT (WriteRows)

await this.handleInsert(tableName, rows);
// → INSERT OR REPLACE INTO table (...) VALUES (...)

UPDATE (UpdateRows)

await this.handleUpdate(tableName, rows);
// → INSERT OR REPLACE INTO table (...) VALUES (row.after)

DELETE (DeleteRows)

await this.handleDelete(tableName, rows);
// → DELETE FROM table WHERE id = ?

Binlog Position Tracking

CDC tracks binlog position for crash recovery:

CREATE TABLE cdc_binlog_position (
  database_id VARCHAR PRIMARY KEY,
  filename VARCHAR NOT NULL,
  position BIGINT NOT NULL,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Resume on restart:

const lastPosition = await this.getLastPosition();
if (lastPosition) {
  zongji.start({
    filename: lastPosition.filename,
    position: lastPosition.position
  });
} else {
  zongji.start({ startAtEnd: true });
}

Backpressure & Queue Management

CDC uses event queue with backpressure to prevent memory exhaustion:

// Pause binlog stream when queue is full
if (this.eventQueue.length >= MAX_QUEUE_SIZE) {
  this.isPaused = true;
  this.pauseBinlogStream();
}

// Resume when queue drains below 50%
if (this.isPaused && this.eventQueue.length < MAX_QUEUE_SIZE / 2) {
  this.isPaused = false;
  this.resumeBinlogStream();
}

API Endpoints

# Start CDC
curl -X POST 'http://localhost:3001/api/cdc/start?db=lms' \
  -H 'Authorization: Bearer ${DUCKLING_API_KEY}'

# Stop CDC
curl -X POST 'http://localhost:3001/api/cdc/stop?db=lms' \
  -H 'Authorization: Bearer ${DUCKLING_API_KEY}'

# Get CDC stats
curl 'http://localhost:3001/api/cdc/stats?db=lms' \
  -H 'Authorization: Bearer ${DUCKLING_API_KEY}'

CDC Statistics

{
  "isRunning": true,
  "connectedAt": "2025-11-06T10:30:00Z",
  "lastEventAt": "2025-11-06T12:45:30Z",
  "eventsProcessed": 125000,
  "insertsProcessed": 45000,
  "updatesProcessed": 65000,
  "deletesProcessed": 15000,
  "errors": 0,
  "currentPosition": {
    "filename": "mysql-bin.000042",
    "position": 1234567890,
    "timestamp": "2025-11-06T12:45:30Z"
  },
  "queueSize": 12,
  "queueHighWaterMark": 450
}

Performance

Latency

Sub-second replication lag

Throughput

Handles 1000+ writes/sec

Ordering

Sequential processing preserves order

Recovery

Auto-resume from last position

CDC provides near real-time replication with typical latency of 100-500ms. Perfect for live dashboards and event-driven applications.

Sync Mode Comparison

Feature	Full Sync	Incremental Sync	CDC
Speed	60K+ rows/sec	10K rows/sec	Sub-second
Use Case	Initial load	Scheduled updates	Real-time
Memory	<500 MB	<100 MB	<50 MB
Latency	Minutes	Seconds	Milliseconds
Requirements	None	Timestamp column	Binlog enabled
Idempotent	✅ Yes	✅ Yes	✅ Yes
ACID	✅ Yes	✅ Yes	⚠️ Per-event

Sync Strategy Recommendations

Production Setup

Initial Load: Full Sync (one-time)
Steady State: Incremental Sync (every 15 minutes)
Real-Time (optional): CDC for critical tables

Configuration

# Enable incremental sync (default: true)
ENABLE_INCREMENTAL_SYNC=true

# Auto-start sync on boot (default: true)
AUTO_START_SYNC=true

# Sync interval in minutes (default: 15)
SYNC_INTERVAL_MINUTES=15

# Batch size for streaming (default: 1000)
BATCH_SIZE=1000

Watermark Management Best Practices

Checkpoint Frequency

Incremental sync checkpoints watermark every 10 batches:

const WATERMARK_CHECKPOINT_INTERVAL = 10;
if (batchesSinceCheckpoint >= WATERMARK_CHECKPOINT_INTERVAL) {
  await this.updateWatermark(tableName, watermark);
  batchesSinceCheckpoint = 0;
}

Boundary Handling

Critical: Use >= (not >) in incremental queries to prevent data loss at timestamp boundaries. INSERT OR REPLACE handles duplicate re-processing idempotently.

Fallback Strategy

If watermark is invalid or missing:

if (!watermark || !watermark.timestampColumn) {
  logger.warn(`No valid watermark, falling back to full sync`);
  return await this.syncTableSequentialWithAppender(tableName);
}

Monitoring Sync Operations

Sync Logs

Every sync operation is logged:

SELECT 
  table_name,
  sync_type,
  records_processed,
  duration_ms,
  status,
  created_at
FROM sync_log
WHERE status = 'success'
ORDER BY created_at DESC
LIMIT 10;

API Endpoint

curl 'http://localhost:3001/api/sync-logs?limit=20&status=success' \
  -H 'Authorization: Bearer ${DUCKLING_API_KEY}'

Next Steps

Architecture

Understand the complete system architecture

Sequential Appender

Deep dive into 60K+ rows/sec performance

Multi-Database

Managing multiple isolated replicas

Get Started

Core Concepts

Setup & Deployment

Operations

Advanced Features

​Overview

​Full Sync

​Use Cases

​How It Works

​API Endpoint

​Performance

Speed

Memory

Atomicity

Crash-Safe

​Example Output

​Incremental Sync

​Use Cases

​Watermark-Based Tracking

​Timestamp Detection Priority

​Query Pattern

​INSERT OR REPLACE Behavior

​API Endpoint

​Performance

Speed

Efficiency

Idempotent

Fallback

​Example Output

​CDC (Change Data Capture)

​Use Cases

​How It Works

​Implementation (cdcService.ts)

​Event Processing

​INSERT (WriteRows)

​UPDATE (UpdateRows)

​DELETE (DeleteRows)

​Binlog Position Tracking

​Backpressure & Queue Management

​API Endpoints

​CDC Statistics

​Performance

Latency

Throughput

Ordering

Recovery

​Sync Mode Comparison

​Sync Strategy Recommendations

​Production Setup

​Configuration

​Watermark Management Best Practices

​Checkpoint Frequency

​Boundary Handling

​Fallback Strategy

​Monitoring Sync Operations

​Sync Logs

​API Endpoint

​Next Steps

Architecture

Sequential Appender

Multi-Database

Build docs developers (and LLMs) love

Overview

Full Sync

Use Cases

How It Works

API Endpoint

Performance

Example Output

Incremental Sync

Use Cases

Watermark-Based Tracking

Timestamp Detection Priority

Query Pattern

INSERT OR REPLACE Behavior

API Endpoint

Performance

Example Output

CDC (Change Data Capture)

Use Cases

How It Works

Implementation (`cdcService.ts`)

Event Processing

INSERT (WriteRows)

UPDATE (UpdateRows)

DELETE (DeleteRows)

Binlog Position Tracking

Backpressure & Queue Management

API Endpoints

CDC Statistics

Performance

Sync Mode Comparison

Sync Strategy Recommendations

Production Setup

Configuration

Watermark Management Best Practices

Checkpoint Frequency

Boundary Handling

Fallback Strategy

Monitoring Sync Operations

Sync Logs

API Endpoint

Next Steps