Skip to main content

Overview

Duckling is designed for high performance out-of-the-box, but you can optimize for your specific workload using configuration tuning. This guide covers all performance-related settings.

Key Performance Factors

Sync Performance

Optimize data replication throughput from MySQL to DuckDB

Query Performance

Maximize analytical query speed on DuckDB

Resource Utilization

Balance CPU, memory, and I/O usage

Concurrency

Handle multiple simultaneous operations

Sync Performance Tuning

Batch Size Configuration

BATCH_SIZE controls how many records are fetched from MySQL per batch:
BATCH_SIZE=1000  # Default
Impact:
  • Higher values (2000-5000): Faster sync, higher memory usage
  • Lower values (500-1000): Slower sync, lower memory usage
Recommendations:
  • Small tables (<100K rows): BATCH_SIZE=500
  • Medium tables (100K-10M rows): BATCH_SIZE=1000 (default)
  • Large tables (>10M rows): BATCH_SIZE=2000-5000
  • Memory constrained: BATCH_SIZE=500
Batch sizes above 5000 may hit MySQL max_allowed_packet limits or cause network timeouts.

Insert Batch Size

INSERT_BATCH_SIZE controls how many records are inserted into DuckDB per batch:
INSERT_BATCH_SIZE=2000  # Default
Impact:
  • Higher values (3000-5000): Faster inserts, higher memory usage
  • Lower values (1000-2000): Slower inserts, lower memory usage
Recommendations:
  • Default: INSERT_BATCH_SIZE=2000
  • High throughput: INSERT_BATCH_SIZE=5000
  • Memory constrained: INSERT_BATCH_SIZE=1000
Duckling uses the DuckDB Appender API which achieves 60,000+ rows/sec, 6x faster than traditional INSERT statements.

Appender Flush Interval

APPENDER_FLUSH_INTERVAL controls how often the Appender flushes to disk (milliseconds):
APPENDER_FLUSH_INTERVAL=5000  # Default (5 seconds)
Impact:
  • Higher values (10000-30000): Better throughput, higher memory usage
  • Lower values (1000-5000): More frequent disk writes, lower memory usage
Recommendations:
  • Default: APPENDER_FLUSH_INTERVAL=5000
  • High throughput: APPENDER_FLUSH_INTERVAL=10000
  • Low latency: APPENDER_FLUSH_INTERVAL=1000

Sync Interval

SYNC_INTERVAL_MINUTES controls how often automatic sync runs:
SYNC_INTERVAL_MINUTES=15  # Default
Impact:
  • Lower values (5-10): Near real-time sync, higher load
  • Higher values (30-60): Lower load, less fresh data
Recommendations:
  • Real-time analytics: SYNC_INTERVAL_MINUTES=5
  • Standard analytics: SYNC_INTERVAL_MINUTES=15 (default)
  • Batch analytics: SYNC_INTERVAL_MINUTES=60
For near real-time sync (<5 minutes), consider enabling CDC (Change Data Capture) instead of polling-based sync.

Connection Pool Tuning

MySQL Connections

MYSQL_MAX_CONNECTIONS sets the MySQL connection pool size:
MYSQL_MAX_CONNECTIONS=5  # Default
Impact:
  • Higher values (10-20): More parallel table fetches, higher MySQL load
  • Lower values (3-5): Less MySQL load, slower sync
Recommendations:
  • Small databases (<20 tables): MYSQL_MAX_CONNECTIONS=3
  • Medium databases (20-100 tables): MYSQL_MAX_CONNECTIONS=5 (default)
  • Large databases (>100 tables): MYSQL_MAX_CONNECTIONS=10
  • Shared MySQL: Keep low (3-5) to avoid overwhelming source
Increasing MySQL connections puts more load on the source database. Monitor MySQL performance when tuning.

DuckDB Connections

DUCKDB_MAX_CONNECTIONS sets the DuckDB connection pool size:
DUCKDB_MAX_CONNECTIONS=10  # Default
Impact:
  • Higher values (20-50): More concurrent queries, higher memory usage
  • Lower values (5-10): Less memory usage, query queuing
Recommendations:
  • Low query volume: DUCKDB_MAX_CONNECTIONS=10 (default)
  • High query volume: DUCKDB_MAX_CONNECTIONS=20-30
  • Memory constrained: DUCKDB_MAX_CONNECTIONS=5

Worker Thread Configuration

WORKER_THREADS controls parallel processing threads:
WORKER_THREADS=0  # Default (auto: CPU cores - 1, min 1)
Options:
  • 0: Auto-detect (recommended)
  • 1-N: Explicit thread count
Impact:
  • More threads: Better parallelism for large table syncs
  • Fewer threads: Lower CPU usage
Recommendations:
  • Default: WORKER_THREADS=0 (auto)
  • CPU constrained: WORKER_THREADS=2
  • High throughput: WORKER_THREADS=8 (match CPU cores)
DuckDB automatically parallelizes queries internally. This setting controls Duckling’s sync operation parallelism.

Query Performance Tuning

Query Governor

MAX_CONCURRENT_QUERIES limits simultaneous query execution:
MAX_CONCURRENT_QUERIES=10  # Default
Impact:
  • Higher values: More concurrent queries, higher resource usage
  • Lower values: Better per-query performance, query queuing
Recommendations:
  • Default: MAX_CONCURRENT_QUERIES=10
  • High concurrency: MAX_CONCURRENT_QUERIES=20
  • Resource constrained: MAX_CONCURRENT_QUERIES=5

Query Timeout

QUERY_TIMEOUT_MS sets maximum query execution time:
QUERY_TIMEOUT_MS=30000  # Default (30 seconds)
Impact:
  • Higher values: Allow long-running queries
  • Lower values: Prevent runaway queries
Recommendations:
  • Interactive queries: QUERY_TIMEOUT_MS=30000 (default)
  • Analytical queries: QUERY_TIMEOUT_MS=300000 (5 minutes)
  • Report generation: QUERY_TIMEOUT_MS=600000 (10 minutes)

Query Queue

QUERY_QUEUE_MAX sets maximum queued queries:
QUERY_QUEUE_MAX=50  # Default
Impact:
  • Higher values: More buffering, higher memory usage
  • Lower values: Faster query rejection under load
Recommendations:
  • Default: QUERY_QUEUE_MAX=50
  • High load: QUERY_QUEUE_MAX=100
  • Fail fast: QUERY_QUEUE_MAX=20

Memory Management

System Requirements

Minimum:
  • CPU: 2 cores
  • RAM: 2 GB
  • Disk: 2x database size
Recommended:
  • CPU: 4+ cores
  • RAM: 8+ GB
  • Disk: 3x database size (for backups)
High Performance:
  • CPU: 8+ cores
  • RAM: 16+ GB
  • Disk: SSD, 3x database size

Memory Usage Patterns

OperationMemory UsagePeak Memory
Sync (1000 batch)~50 MB per table~200 MB
Sync (5000 batch)~250 MB per table~1 GB
Query (simple)~10 MB~50 MB
Query (complex)~100 MB~500 MB
Backup~100 MB~500 MB
S3 UploadStreaming~100 MB
DuckDB uses memory-mapped I/O for database files. Ensure sufficient free RAM for the OS to cache frequently accessed data.

Monitoring Memory

Check system metrics:
curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"
Watch for:
  • rssMB > 80% of available RAM
  • heapUsedMB growing continuously (memory leak)
  • eventLoopLagMs > 100ms (overload)

I/O Optimization

Storage Type

Recommendations:
  • SSD: 10x faster queries, highly recommended
  • NVMe: Best performance for large databases
  • HDD: Only for small databases or budget constraints

File System

Best:
  • ext4 (Linux)
  • XFS (Linux, large files)
  • APFS (macOS)
Avoid:
  • NFS (network latency)
  • SMB/CIFS (network latency)
  • FAT32 (file size limits)

Docker Volume Performance

Bind Mount (Recommended):
volumes:
  - ./data:/app/data  # Direct host filesystem access
Named Volume:
volumes:
  - duckdb-data:/app/data  # Docker managed, slightly slower
Avoid Docker volume drivers that use network storage (NFS, EBS, etc.) for the DuckDB database file. Use local SSD storage for best performance.

Network Optimization

MySQL Network

Latency Impact:
  • <1ms (same server): Excellent
  • 1-10ms (same datacenter): Good
  • 10-50ms (same region): Acceptable
  • >50ms (cross-region): Increase batch sizes
Optimizations:
  • Use private network when possible
  • Increase BATCH_SIZE for high latency
  • Enable MySQL query cache on source

API Performance

Query API:
  • Use LIMIT clauses for large result sets
  • Enable compression for large responses
  • Use pagination for table data
Connection:
  • Keep-alive connections
  • HTTP/2 for better multiplexing
  • CDN for frontend assets

Performance Monitoring

Key Metrics

Sync Performance:
curl "http://localhost:3001/api/sync-logs?limit=100&db=your-database-id" \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.[] | {table: .table_name, records: .records_processed, duration: .duration_ms, rate: (.records_processed / (.duration_ms / 1000))}'
Target rates:
  • Appender API: 60,000+ rows/sec
  • Small batches: 10,000+ rows/sec
Query Performance:
curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.queries.patterns | sort_by(.avgMs) | reverse | .[0:10]'
Target times:
  • Simple aggregations: <100ms
  • Complex joins: <1s
  • Report queries: <10s
System Health:
curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.system.current'
Watch for:
  • cpuPercent < 80%
  • rssMB < 80% of available
  • eventLoopLagMs < 50ms

Configuration Profiles

High Throughput Profile

Goal: Maximum sync speed, high resource usage
# Sync
BATCH_SIZE=5000
INSERT_BATCH_SIZE=5000
APPENDER_FLUSH_INTERVAL=10000
SYNC_INTERVAL_MINUTES=5

# Connections
MYSQL_MAX_CONNECTIONS=10
DUCKDB_MAX_CONNECTIONS=30

# Workers
WORKER_THREADS=0  # Auto

# Query
MAX_CONCURRENT_QUERIES=20
QUERY_TIMEOUT_MS=300000

Balanced Profile (Default)

Goal: Good performance, moderate resource usage
# Sync
BATCH_SIZE=1000
INSERT_BATCH_SIZE=2000
APPENDER_FLUSH_INTERVAL=5000
SYNC_INTERVAL_MINUTES=15

# Connections
MYSQL_MAX_CONNECTIONS=5
DUCKDB_MAX_CONNECTIONS=10

# Workers
WORKER_THREADS=0  # Auto

# Query
MAX_CONCURRENT_QUERIES=10
QUERY_TIMEOUT_MS=30000

Low Resource Profile

Goal: Minimal resource usage, slower performance
# Sync
BATCH_SIZE=500
INSERT_BATCH_SIZE=1000
APPENDER_FLUSH_INTERVAL=5000
SYNC_INTERVAL_MINUTES=30

# Connections
MYSQL_MAX_CONNECTIONS=3
DUCKDB_MAX_CONNECTIONS=5

# Workers
WORKER_THREADS=2

# Query
MAX_CONCURRENT_QUERIES=5
QUERY_TIMEOUT_MS=30000

Benchmarking

Sync Benchmark

Test sync performance:
1

Clear Database

rm data/your-database.db
2

Run Full Sync

time curl -X POST "http://localhost:3001/api/sync/full?db=your-database-id" \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"
3

Check Sync Logs

curl "http://localhost:3001/api/sync-logs?limit=100&db=your-database-id" \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}"
Calculate rows/second:
total_records / total_seconds = rows_per_second

Query Benchmark

Test query performance:
# Simple aggregation
time curl -X POST http://localhost:3001/api/query?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT COUNT(*) FROM User"}'

# Complex join
time curl -X POST http://localhost:3001/api/query?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT u.*, COUNT(o.id) FROM User u LEFT JOIN Order o ON u.id = o.userId GROUP BY u.id"}'

Troubleshooting Performance

Slow Sync

1

Check Sync Logs

Identify slow tables:
curl "http://localhost:3001/api/sync-logs?db=your-database-id" \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.[] | {table: .table_name, duration: .duration_ms}' | \
  sort -k2 -n
2

Check MySQL Performance

Test direct MySQL query speed:
docker exec duckling-server node scripts/mysql.js "SELECT COUNT(*) FROM SlowTable"
3

Increase Batch Size

For large tables:
BATCH_SIZE=2000
INSERT_BATCH_SIZE=3000
4

Add MySQL Index

Index timestamp columns used for watermarks:
CREATE INDEX idx_updated_at ON SlowTable(updatedAt);

Slow Queries

1

Check Query Patterns

curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.queries.patterns | sort_by(.avgMs) | reverse'
2

Optimize SQL

  • Add WHERE clauses to filter early
  • Use LIMIT for large result sets
  • Avoid SELECT * when possible
  • Use indexes on join columns
3

Monitor Active Queries

curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.queries.active'

High Memory Usage

1

Check System Metrics

curl http://localhost:3001/metrics?db=your-database-id \
  -H "Authorization: Bearer ${DUCKLING_API_KEY}" | \
  jq '.system.current'
2

Reduce Batch Sizes

BATCH_SIZE=500
INSERT_BATCH_SIZE=1000
3

Reduce Connections

DUCKDB_MAX_CONNECTIONS=5
MAX_CONCURRENT_QUERIES=5
4

Restart Server

If memory leak suspected:
docker-compose restart duckdb-server

Next Steps

Build docs developers (and LLMs) love