Skip to main content
Pruning is an operation that deletes old entity versions from a deployment, reducing storage requirements and improving both query and indexing performance.

Overview

By default, Graph Node stores a complete version history for all entities, allowing time-travel queries to any historical block. While powerful, this creates significant storage overhead and can degrade performance. Pruning removes entity versions older than a specified block, making it impossible to query the deployment at blocks prior to the pruning point. However, queries at blocks after the pruning point are unaffected and perform better due to reduced data volume.

Benefits

  • Reduced storage: Often reduces deployment data by a very large amount
  • Faster queries: Less data to scan improves query performance considerably
  • Better indexing speed: Smaller tables and indexes speed up block processing
  • Lower infrastructure costs: Decreased storage and compute requirements

Tradeoffs

Pruning is a user-visible operation with important limitations:
  • Limited time-travel queries: Can only query back to the earliest_block after pruning
  • Restricted grafting: Cannot graft further back than history_blocks before the current head
  • Irreversible: Pruned data cannot be recovered without reindexing from scratch

How Pruning Works

History Retention

Pruning is controlled by the history_blocks parameter, which specifies how many blocks of history to retain. The actual pruning block b is calculated as:
b = latest_block - history_blocks
After pruning, the earliest_block for the deployment is updated, and Graph Node returns an error for any query attempting to time-travel before this block.
The value of history_blocks must be greater than ETHEREUM_REORG_THRESHOLD to ensure that chain reorganizations never conflict with pruning operations.

Accessing Earliest Block

You can retrieve the earliest_block through the Index Node status API:
curl http://localhost:8030/graphql \
  -H 'Content-Type: application/json' \
  -d '{"query": "{ indexingStatusForCurrentVersion(subgraphName: \"author/subgraph\") { chains { earliestBlock { number } } } }"}'

Using Graphman Prune

Initial Pruning

Start pruning a deployment with:
graphman --config config.toml prune <deployment> --history <blocks>
Arguments:
  • <deployment> - Deployment identifier (name, IPFS hash, or namespace)
  • --history <blocks> - Number of blocks of history to retain
Example:
graphman --config config.toml prune author/subgraph --history 10000
This command:
  1. Performs an initial prune of the deployment
  2. Sets the deployment’s history_blocks configuration
  3. Enables automatic repruning when more history accumulates

One-Time Pruning

To prune once without enabling automatic repruning:
graphman --config config.toml prune <deployment> --history <blocks> --once
This performs a single prune operation and does not set up ongoing maintenance.

Disabling Automatic Pruning

To stop automatic repruning after initial setup:
graphman --config config.toml prune <deployment> --history 999999999
Setting history_blocks to a very large value effectively disables repruning.

Automatic Repruning

How It Works

After initial pruning, Graph Node automatically reprunes the deployment when it accumulates more than:
history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR
blocks of history. Example calculation:
  • GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5
  • history_blocks=10000
  • Reprune triggers when history exceeds: 10000 * 1.5 = 15000 blocks

Reprune Frequency

Repruning occurs every:
history_blocks * (GRAPH_STORE_HISTORY_SLACK_FACTOR - 1)
blocks. With the example above: 10000 * (1.5 - 1) = 5000 blocks between reprunes.

Configuration

Set the slack factor with the environment variable:
GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5
Set this value high enough so repruning occurs relatively infrequently to avoid excessive database work. Values between 1.3 and 2.0 are typical.

Pruning Strategies

Graph Node uses two different strategies to remove unneeded data:

Delete Strategy

How it works: Deletes rows (old entity versions) from existing tables. When used: When the estimated fraction of data to remove is between DELETE_THRESHOLD and REBUILD_THRESHOLD. Advantages:
  • Does not block indexing
  • Simpler operation
  • Lower peak resource usage
Disadvantages:
  • Tables remain fragmented
  • Storage is not immediately reclaimed
  • May be slower for large deletions

Rebuild Strategy

How it works: Copies data that should be kept into new tables, then replaces the existing tables with these smaller tables. When used: When the estimated fraction of data to remove exceeds REBUILD_THRESHOLD. Advantages:
  • Much smaller final tables
  • Storage immediately reclaimed
  • Better performance for large deletions
  • Tables are defragmented
Disadvantages:
  • Temporarily blocks indexing while copying non-final entities
  • Higher peak resource usage
  • More complex operation

Strategy Selection Thresholds

Configure the thresholds with environment variables (values between 0 and 1):
GRAPH_STORE_HISTORY_REBUILD_THRESHOLD=0.7
GRAPH_STORE_HISTORY_DELETE_THRESHOLD=0.3
For each table:
  • If estimated removal > REBUILD_THRESHOLD: rebuild the table
  • If estimated removal between DELETE_THRESHOLD and REBUILD_THRESHOLD: delete old versions
  • If estimated removal < DELETE_THRESHOLD: skip (not worth processing)
Graph Node automatically analyzes tables to estimate removal fractions, using Postgres statistics for accuracy.

Batching and Performance

Batch Target Duration

To avoid very long-running transactions, pruning operations are broken into batches:
GRAPH_STORE_BATCH_TARGET_DURATION=120  # seconds
Each batch aims to complete within this duration.

Parallel Operation

In most cases, pruning runs in parallel with indexing without blocking it. Exception: When using the rebuild strategy, indexing is blocked while copying non-final entities from the existing table to the new table. This is typically brief but depends on the number of non-final entities.

Monitoring Pruning Operations

Initial Prune

The graphman prune command prints a progress report to the console:
Start pruning historical entities from block 1000000 to 15000000
Analyzing tables...
Analyzed 24 tables
Rebuilding table user (estimated 85% removal)
Deleting from table token (estimated 45% removal)
Skipping table config (estimated 10% removal)
Finished pruning entities: deleted 1.2M rows, copied 300K rows in 45m 30s

Ongoing Repruning

For automatic reprune operations, watch Graph Node logs for:
1

Start message

Start pruning historical entities from block 15005000 to 20000000
Indicates pruning has begun, shows block range.
2

Analysis message

Analyzed N tables
Reports how many tables were analyzed for pruning.
3

Completion message

Finished pruning entities: deleted 500K rows, copied 100K rows in 12m 15s
Shows total rows deleted/copied and duration.

Metrics

While Graph Node doesn’t expose dedicated pruning metrics, you can monitor:
# Deployment head growth (should continue during pruning)
rate(deployment_head[5m])

# Block processing time (may temporarily increase during rebuild)
rate(deployment_block_processing_duration_sum[5m]) / rate(deployment_block_processing_duration_count[5m])

Use Cases and Recommendations

Time-Series Data

Scenario: Subgraph tracks daily statistics (e.g., daily trading volume, daily active users). Recommendation:
graphman --config config.toml prune <deployment> --history 30000
Why: Daily statistics are immutable once the day ends. Pruning old entity versions doesn’t affect time-series queries but dramatically reduces storage.

Event Logs

Scenario: Subgraph indexes event logs that are never updated. Recommendation:
graphman --config config.toml prune <deployment> --history 50000
Why: Events don’t have historical versions, so pruning provides storage benefits with minimal impact on queries.

Lifetime Statistics

Scenario: Entities track lifetime statistics (e.g., total token transfers, all-time high price). Consideration: Pruning limits how far back you can generate time-series from these statistics. Recommendation:
graphman --config config.toml prune <deployment> --history 100000
Why: Set history_blocks based on the oldest historical query your application needs. For example, 100,000 blocks provides ~2 weeks of Ethereum mainnet history.

High-Churn Entities

Scenario: Entities update frequently (e.g., token prices, user balances). Recommendation:
graphman --config config.toml prune <deployment> --history 10000
Why: High-churn entities generate many versions quickly. Aggressive pruning (lower history_blocks) provides significant storage and performance benefits.

Archived Deployments

Scenario: Old deployment versions no longer receiving queries. Recommendation:
graphman --config config.toml prune <deployment> --history 1000 --once
Why: Minimal history retention frees maximum storage. Use --once if you plan to remove the deployment entirely soon.

Best Practices

Begin with a higher value (e.g., 50,000 blocks) and reduce it based on your actual query patterns. Monitor query errors to find the right balance.
Always set history_blocks significantly higher than ETHEREUM_REORG_THRESHOLD to prevent conflicts. Add at least 1000 blocks as a safety margin.
The first prune can take considerable time for large deployments. Run it during low-traffic periods if possible.
Prune a copy of your deployment in a staging environment to estimate duration and verify query compatibility.
Record why you chose a specific value so future maintainers understand the tradeoff.
If you might need to graft from this deployment, ensure history_blocks is large enough to support your grafting point.
For deployments with many tables, adjust REBUILD_THRESHOLD to balance between performance gains and indexing interruptions.

Troubleshooting

Queries return “block not found” errors

Cause: Query requests a block before earliest_block. Solution:
  1. Check the deployment’s earliest_block via the status API
  2. Modify queries to respect this limit
  3. Consider increasing history_blocks if this is too restrictive

Pruning takes too long

Causes and solutions:
Symptom: Rebuild operations taking hours.Solution: This is expected for very large deployments. Consider:
  • Running initial prune during maintenance windows
  • Lowering REBUILD_THRESHOLD to use delete strategy instead
  • Increasing GRAPH_STORE_BATCH_TARGET_DURATION for fewer, larger batches
Symptom: Reprune operations running too often.Solution: Increase GRAPH_STORE_HISTORY_SLACK_FACTOR:
GRAPH_STORE_HISTORY_SLACK_FACTOR=2.0
This reduces reprune frequency but allows more history accumulation.
Symptom: Pruning analyzes tables repeatedly or chooses suboptimal strategies.Solution: Manually update Postgres statistics:
ANALYZE sgd123.entity_table;

Indexing blocked during rebuild

Symptom: Deployment stops indexing during prune operations. Expected behavior: This occurs when copying non-final entities during table rebuilds. Solutions:
  • Reduce rebuild frequency: Increase GRAPH_STORE_HISTORY_SLACK_FACTOR
  • Use delete strategy: Lower REBUILD_THRESHOLD to favor deletions
  • Smaller batches: Reduce GRAPH_STORE_BATCH_TARGET_DURATION for shorter blocking periods

Storage not reclaimed after pruning

Cause: Delete strategy marks space as reusable but doesn’t return it to the OS. Solutions:
1

Run VACUUM

VACUUM sgd123.entity_table;
Marks deleted space as reusable within Postgres.
2

Run VACUUM FULL (requires downtime)

VACUUM FULL sgd123.entity_table;
Returns space to the OS but locks the table.
3

Force rebuild strategy

Lower REBUILD_THRESHOLD to trigger rebuilds more aggressively:
GRAPH_STORE_HISTORY_REBUILD_THRESHOLD=0.5

Cannot graft from pruned deployment

Cause: Target graft block is before the deployment’s earliest_block. Solution:
  • Increase history_blocks before pruning if grafting is planned
  • Use a different source deployment that hasn’t been pruned
  • Reindex the source deployment from scratch if necessary

Configuration Reference

Environment Variables

# How much history before triggering reprune (default: 1.5)
GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5

# Rebuild if removing more than this fraction (default: 0.7)
GRAPH_STORE_HISTORY_REBUILD_THRESHOLD=0.7

# Delete if removing more than this fraction (default: 0.3)
GRAPH_STORE_HISTORY_DELETE_THRESHOLD=0.3

# Target duration for each batch in seconds (default: 120)
GRAPH_STORE_BATCH_TARGET_DURATION=120

Graphman Commands

# Initial prune with automatic repruning
graphman --config config.toml prune <deployment> --history <blocks>

# One-time prune without automatic repruning
graphman --config config.toml prune <deployment> --history <blocks> --once

# Disable automatic repruning
graphman --config config.toml prune <deployment> --history 999999999

Additional Resources

Build docs developers (and LLMs) love