Skip to main content
The DuckDB accelerator provides high-performance analytical query acceleration using an embedded DuckDB database. It supports both memory and file modes, making it suitable for datasets that exceed available RAM.

When to Use DuckDB

  • Analytical workloads: Complex aggregations, window functions, joins
  • Larger datasets: Use file mode for data that doesn’t fit in memory
  • Advanced SQL: Requires DuckDB-specific features
  • Single-file storage: Dataset can fit in one DuckDB file (<100GB typical)

Configuration

Memory Mode

Data stored in RAM (volatile):
datasets:
  - name: analytics
    from: s3://data-lake/events/
    acceleration:
      enabled: true
      engine: duckdb
      mode: memory

File Mode

Data persisted to disk:
acceleration:
  enabled: true
  engine: duckdb
  mode: file
  params:
    duckdb_file: /data/my_dataset.duckdb

Default File Location

If duckdb_file not specified, uses {spice_data_dir}/accelerated_duckdb.db:
acceleration:
  enabled: true
  engine: duckdb
  mode: file

File Create Mode

Deletes existing file on startup:
acceleration:
  enabled: true
  engine: duckdb
  mode: file_create
  params:
    duckdb_file: /data/fresh_dataset.duckdb

Refresh Modes

Full Refresh

Replaces all data:
acceleration:
  enabled: true
  engine: duckdb
  refresh_mode: full
  refresh_interval: 1h

Append Mode

Appends only new data based on time column:
acceleration:
  enabled: true
  engine: duckdb
  refresh_mode: append
  refresh_interval: 5m
  time_column: created_at

Snapshots

Bootstrap from S3 snapshots for fast cold starts:
acceleration:
  enabled: true
  engine: duckdb
  mode: file
  snapshot:
    enabled: true
    source: s3://my-bucket/snapshots/my_dataset/
    refresh: true
Snapshots support file mode only. Downloading a snapshot is much faster than full data refresh.

Performance Features

Preserve Insertion Order

Maintains insertion order using internal tables:
params:
  preserve_insertion_order: true
Useful for time-series data but adds overhead.

On-Refresh Sorting

Sort data after each refresh for better query performance:
params:
  on_refresh_sort_columns: timestamp DESC, user_id ASC
Note: Recreates table, dropping indexes and constraints.

Recompute Statistics

Recompute statistics after writes for optimal query plans:
params:
  on_refresh_recompute_statistics: true

Index Configuration

params:
  index_scan_percentage: 0.6  # Use index if selectivity < 60%
  index_scan_max_count: 1000000  # Max rows for index scan

Connection Pooling

Pool Size

Configure connection pool size:
params:
  connection_pool_size: 20  # Default: 10 or number of datasets
Increases concurrency for high-traffic deployments.

Memory Limit

Set per-connection memory limit:
params:
  memory_limit: 4GB

Federation

DuckDB accelerators can federate queries across multiple file-mode DuckDB databases. Spice automatically attaches other DuckDB databases.
datasets:
  - name: orders
    acceleration:
      engine: duckdb
      mode: file
      params:
        duckdb_file: /data/orders.duckdb
  
  - name: customers
    acceleration:
      engine: duckdb
      mode: file
      params:
        duckdb_file: /data/customers.duckdb
Query across both:
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Retention SQL

Delete old data automatically on each refresh:
acceleration:
  enabled: true
  engine: duckdb
  retention_sql: |
    DELETE FROM my_dataset 
    WHERE timestamp < CURRENT_TIMESTAMP - INTERVAL '30 days'
Executed before commit in a transaction.

Primary Keys and Constraints

Primary Key

acceleration:
  enabled: true
  engine: duckdb
  primary_key: id

Composite Primary Key

primary_key:
  - customer_id
  - order_id

On Conflict Behavior

acceleration:
  enabled: true
  engine: duckdb
  primary_key: id
  on_conflict: upsert

File Extensions

DuckDB supports multiple file extensions:
  • .db
  • .ddb
  • .duckdb

Performance Characteristics

Query Performance

OperationPerformanceNotes
Full table scanExcellentColumnar storage, SIMD
AggregationsExcellentOptimized group-by
Window functionsExcellentNative support
JoinsExcellentHash and merge joins
Point queriesGoodBetter with indexes

Storage

FeatureDetails
FormatColumnar (compressed)
CompressionAutomatic (zstd, dictionary, RLE)
RatioTypically 5-10x compression
Max file size~100GB typical, larger possible

Partitioning

DuckDB supports table partitioning:
acceleration:
  enabled: true
  engine: duckdb
  partition_by:
    - year
    - month
  params:
    partition_mode: tables
Creates separate tables per partition for pruning.

Example Configurations

Time-Series Analytics

datasets:
  - name: events
    from: kafka://localhost:9092/events
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      refresh_mode: append
      refresh_interval: 1m
      time_column: event_time
      params:
        duckdb_file: /data/events.duckdb
        on_refresh_sort_columns: event_time DESC
      retention_sql: |
        DELETE FROM events 
        WHERE event_time < CURRENT_TIMESTAMP - INTERVAL '7 days'

Large Dataset with Snapshot

datasets:
  - name: sales
    from: s3://data-lake/sales/
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      params:
        duckdb_file: /data/sales.duckdb
        memory_limit: 8GB
        connection_pool_size: 20
      snapshot:
        enabled: true
        source: s3://snapshots/sales/
        refresh: true
      refresh_interval: 1h

In-Memory Analytics

datasets:
  - name: metrics
    from: postgres://db/metrics
    acceleration:
      enabled: true
      engine: duckdb
      mode: memory
      refresh_interval: 5m

Monitoring

-- File size
SELECT 
  dataset_name,
  file_size_bytes / 1024 / 1024 as size_mb
FROM runtime.metrics
WHERE name = 'acceleration_file_size';

-- Row count
SELECT 
  dataset_name,
  value as row_count
FROM runtime.metrics
WHERE name = 'acceleration_rows';

Parameters

ParameterTypeDescriptionDefault
duckdb_filestringPath to DuckDB fileauto
memory_limitstringPer-connection memory limit (e.g., “4GB”)-
connection_pool_sizeintegerMax connections in pool10
preserve_insertion_orderbooleanMaintain row insertion orderfalse
on_refresh_sort_columnsstringSort columns after refresh (e.g., “time DESC”)-
on_refresh_recompute_statisticsbooleanRecompute stats after writesfalse
index_scan_percentagefloatSelectivity threshold for index use0.6
index_scan_max_countintegerMax rows for index scan1000000

Limitations

  • Single-file storage (partition for larger datasets)
  • File mode requires local disk
  • No built-in replication (use snapshots for backup)

Next Steps

Build docs developers (and LLMs) love