DuckDB Accelerator

The DuckDB accelerator provides high-performance analytical query acceleration using an embedded DuckDB database. It supports both memory and file modes, making it suitable for datasets that exceed available RAM.

When to Use DuckDB

Analytical workloads: Complex aggregations, window functions, joins
Larger datasets: Use file mode for data that doesn’t fit in memory
Advanced SQL: Requires DuckDB-specific features
Single-file storage: Dataset can fit in one DuckDB file (<100GB typical)

Configuration

Memory Mode

Data stored in RAM (volatile):

datasets:
  - name: analytics
    from: s3://data-lake/events/
    acceleration:
      enabled: true
      engine: duckdb
      mode: memory

File Mode

Data persisted to disk:

acceleration:
  enabled: true
  engine: duckdb
  mode: file
  params:
    duckdb_file: /data/my_dataset.duckdb

Default File Location

If duckdb_file not specified, uses {spice_data_dir}/accelerated_duckdb.db:

acceleration:
  enabled: true
  engine: duckdb
  mode: file

File Create Mode

Deletes existing file on startup:

acceleration:
  enabled: true
  engine: duckdb
  mode: file_create
  params:
    duckdb_file: /data/fresh_dataset.duckdb

Refresh Modes

Full Refresh

Replaces all data:

acceleration:
  enabled: true
  engine: duckdb
  refresh_mode: full
  refresh_interval: 1h

Append Mode

Appends only new data based on time column:

acceleration:
  enabled: true
  engine: duckdb
  refresh_mode: append
  refresh_interval: 5m
  time_column: created_at

Snapshots

Bootstrap from S3 snapshots for fast cold starts:

acceleration:
  enabled: true
  engine: duckdb
  mode: file
  snapshot:
    enabled: true
    source: s3://my-bucket/snapshots/my_dataset/
    refresh: true

Snapshots support file mode only. Downloading a snapshot is much faster than full data refresh.

Performance Features

Preserve Insertion Order

Maintains insertion order using internal tables:

params:
  preserve_insertion_order: true

Useful for time-series data but adds overhead.

On-Refresh Sorting

Sort data after each refresh for better query performance:

params:
  on_refresh_sort_columns: timestamp DESC, user_id ASC

Note: Recreates table, dropping indexes and constraints.

Recompute Statistics

Recompute statistics after writes for optimal query plans:

params:
  on_refresh_recompute_statistics: true

Index Configuration

params:
  index_scan_percentage: 0.6  # Use index if selectivity < 60%
  index_scan_max_count: 1000000  # Max rows for index scan

Connection Pooling

Pool Size

Configure connection pool size:

params:
  connection_pool_size: 20  # Default: 10 or number of datasets

Increases concurrency for high-traffic deployments.

Memory Limit

Set per-connection memory limit:

params:
  memory_limit: 4GB

Federation

DuckDB accelerators can federate queries across multiple file-mode DuckDB databases. Spice automatically attaches other DuckDB databases.

datasets:
  - name: orders
    acceleration:
      engine: duckdb
      mode: file
      params:
        duckdb_file: /data/orders.duckdb
  
  - name: customers
    acceleration:
      engine: duckdb
      mode: file
      params:
        duckdb_file: /data/customers.duckdb

Query across both:

SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Retention SQL

Delete old data automatically on each refresh:

acceleration:
  enabled: true
  engine: duckdb
  retention_sql: |
    DELETE FROM my_dataset 
    WHERE timestamp < CURRENT_TIMESTAMP - INTERVAL '30 days'

Executed before commit in a transaction.

Primary Keys and Constraints

Primary Key

acceleration:
  enabled: true
  engine: duckdb
  primary_key: id

Composite Primary Key

primary_key:
  - customer_id
  - order_id

On Conflict Behavior

acceleration:
  enabled: true
  engine: duckdb
  primary_key: id
  on_conflict: upsert

File Extensions

DuckDB supports multiple file extensions:

.db
.ddb
.duckdb

Performance Characteristics

Query Performance

Operation	Performance	Notes
Full table scan	Excellent	Columnar storage, SIMD
Aggregations	Excellent	Optimized group-by
Window functions	Excellent	Native support
Joins	Excellent	Hash and merge joins
Point queries	Good	Better with indexes

Storage

Feature	Details
Format	Columnar (compressed)
Compression	Automatic (zstd, dictionary, RLE)
Ratio	Typically 5-10x compression
Max file size	~100GB typical, larger possible

Partitioning

DuckDB supports table partitioning:

acceleration:
  enabled: true
  engine: duckdb
  partition_by:
    - year
    - month
  params:
    partition_mode: tables

Creates separate tables per partition for pruning.

Example Configurations

Time-Series Analytics

datasets:
  - name: events
    from: kafka://localhost:9092/events
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      refresh_mode: append
      refresh_interval: 1m
      time_column: event_time
      params:
        duckdb_file: /data/events.duckdb
        on_refresh_sort_columns: event_time DESC
      retention_sql: |
        DELETE FROM events 
        WHERE event_time < CURRENT_TIMESTAMP - INTERVAL '7 days'

Large Dataset with Snapshot

datasets:
  - name: sales
    from: s3://data-lake/sales/
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      params:
        duckdb_file: /data/sales.duckdb
        memory_limit: 8GB
        connection_pool_size: 20
      snapshot:
        enabled: true
        source: s3://snapshots/sales/
        refresh: true
      refresh_interval: 1h

In-Memory Analytics

datasets:
  - name: metrics
    from: postgres://db/metrics
    acceleration:
      enabled: true
      engine: duckdb
      mode: memory
      refresh_interval: 5m

Monitoring

-- File size
SELECT 
  dataset_name,
  file_size_bytes / 1024 / 1024 as size_mb
FROM runtime.metrics
WHERE name = 'acceleration_file_size';

-- Row count
SELECT 
  dataset_name,
  value as row_count
FROM runtime.metrics
WHERE name = 'acceleration_rows';

Parameters

Parameter	Type	Description	Default
duckdb_file	string	Path to DuckDB file	auto
memory_limit	string	Per-connection memory limit (e.g., “4GB”)	-
connection_pool_size	integer	Max connections in pool	10
preserve_insertion_order	boolean	Maintain row insertion order	false
on_refresh_sort_columns	string	Sort columns after refresh (e.g., “time DESC”)	-
on_refresh_recompute_statistics	boolean	Recompute stats after writes	false
index_scan_percentage	float	Selectivity threshold for index use	0.6
index_scan_max_count	integer	Max rows for index scan	1000000

Limitations

Single-file storage (partition for larger datasets)
File mode requires local disk
No built-in replication (use snapshots for backup)

Next Steps

Cayenne Accelerator - For multi-file append-heavy workloads
Arrow Accelerator - For in-memory speed
Snapshots Documentation

Get Started

Core Concepts

Data Connectors

Data Accelerators

Search

AI & ML

Deployment

​When to Use DuckDB

​Configuration

​Memory Mode

​File Mode

​Default File Location

​File Create Mode

​Refresh Modes

​Full Refresh

​Append Mode

​Snapshots

​Performance Features

​Preserve Insertion Order

​On-Refresh Sorting

​Recompute Statistics

​Index Configuration

​Connection Pooling

​Pool Size

​Memory Limit

​Federation

​Retention SQL

​Primary Keys and Constraints

​Primary Key

​Composite Primary Key

​On Conflict Behavior

​File Extensions

​Performance Characteristics

​Query Performance

​Storage

​Partitioning

​Example Configurations

​Time-Series Analytics

​Large Dataset with Snapshot

​In-Memory Analytics

​Monitoring

​Parameters

​Limitations

​Next Steps

Build docs developers (and LLMs) love