Data Pruning - Graph Node

Pruning is an operation that deletes old entity versions from a deployment, reducing storage requirements and improving both query and indexing performance.

Overview

By default, Graph Node stores a complete version history for all entities, allowing time-travel queries to any historical block. While powerful, this creates significant storage overhead and can degrade performance. Pruning removes entity versions older than a specified block, making it impossible to query the deployment at blocks prior to the pruning point. However, queries at blocks after the pruning point are unaffected and perform better due to reduced data volume.

Benefits

Reduced storage: Often reduces deployment data by a very large amount
Faster queries: Less data to scan improves query performance considerably
Better indexing speed: Smaller tables and indexes speed up block processing
Lower infrastructure costs: Decreased storage and compute requirements

Tradeoffs

Pruning is a user-visible operation with important limitations:

Limited time-travel queries: Can only query back to the earliest_block after pruning
Restricted grafting: Cannot graft further back than history_blocks before the current head
Irreversible: Pruned data cannot be recovered without reindexing from scratch

How Pruning Works

History Retention

Pruning is controlled by the history_blocks parameter, which specifies how many blocks of history to retain. The actual pruning block b is calculated as:

b = latest_block - history_blocks

After pruning, the earliest_block for the deployment is updated, and Graph Node returns an error for any query attempting to time-travel before this block.

The value of history_blocks must be greater than ETHEREUM_REORG_THRESHOLD to ensure that chain reorganizations never conflict with pruning operations.

Accessing Earliest Block

You can retrieve the earliest_block through the Index Node status API:

curl http://localhost:8030/graphql \
  -H 'Content-Type: application/json' \
  -d '{"query": "{ indexingStatusForCurrentVersion(subgraphName: \"author/subgraph\") { chains { earliestBlock { number } } } }"}'

Using Graphman Prune

Initial Pruning

Start pruning a deployment with:

graphman --config config.toml prune <deployment> --history <blocks>

Arguments:

<deployment> - Deployment identifier (name, IPFS hash, or namespace)
--history <blocks> - Number of blocks of history to retain

Example:

graphman --config config.toml prune author/subgraph --history 10000

This command:

Performs an initial prune of the deployment
Sets the deployment’s history_blocks configuration
Enables automatic repruning when more history accumulates

One-Time Pruning

To prune once without enabling automatic repruning:

graphman --config config.toml prune <deployment> --history <blocks> --once

This performs a single prune operation and does not set up ongoing maintenance.

Disabling Automatic Pruning

To stop automatic repruning after initial setup:

graphman --config config.toml prune <deployment> --history 999999999

Setting history_blocks to a very large value effectively disables repruning.

Automatic Repruning

How It Works

After initial pruning, Graph Node automatically reprunes the deployment when it accumulates more than:

history_blocks * GRAPH_STORE_HISTORY_SLACK_FACTOR

blocks of history. Example calculation:

GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5
history_blocks=10000
Reprune triggers when history exceeds: 10000 * 1.5 = 15000 blocks

Reprune Frequency

Repruning occurs every:

history_blocks * (GRAPH_STORE_HISTORY_SLACK_FACTOR - 1)

blocks. With the example above: 10000 * (1.5 - 1) = 5000 blocks between reprunes.

Configuration

Set the slack factor with the environment variable:

GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5

Set this value high enough so repruning occurs relatively infrequently to avoid excessive database work. Values between 1.3 and 2.0 are typical.

Pruning Strategies

Graph Node uses two different strategies to remove unneeded data:

Delete Strategy

How it works: Deletes rows (old entity versions) from existing tables. When used: When the estimated fraction of data to remove is between DELETE_THRESHOLD and REBUILD_THRESHOLD. Advantages:

Does not block indexing
Simpler operation
Lower peak resource usage

Disadvantages:

Tables remain fragmented
Storage is not immediately reclaimed
May be slower for large deletions

Rebuild Strategy

How it works: Copies data that should be kept into new tables, then replaces the existing tables with these smaller tables. When used: When the estimated fraction of data to remove exceeds REBUILD_THRESHOLD. Advantages:

Much smaller final tables
Storage immediately reclaimed
Better performance for large deletions
Tables are defragmented

Disadvantages:

Temporarily blocks indexing while copying non-final entities
Higher peak resource usage
More complex operation

Strategy Selection Thresholds

Configure the thresholds with environment variables (values between 0 and 1):

GRAPH_STORE_HISTORY_REBUILD_THRESHOLD=0.7
GRAPH_STORE_HISTORY_DELETE_THRESHOLD=0.3

For each table:

If estimated removal > REBUILD_THRESHOLD: rebuild the table
If estimated removal between DELETE_THRESHOLD and REBUILD_THRESHOLD: delete old versions
If estimated removal < DELETE_THRESHOLD: skip (not worth processing)

Graph Node automatically analyzes tables to estimate removal fractions, using Postgres statistics for accuracy.

Batching and Performance

Batch Target Duration

To avoid very long-running transactions, pruning operations are broken into batches:

GRAPH_STORE_BATCH_TARGET_DURATION=120  # seconds

Each batch aims to complete within this duration.

Parallel Operation

In most cases, pruning runs in parallel with indexing without blocking it. Exception: When using the rebuild strategy, indexing is blocked while copying non-final entities from the existing table to the new table. This is typically brief but depends on the number of non-final entities.

Monitoring Pruning Operations

Initial Prune

The graphman prune command prints a progress report to the console:

Start pruning historical entities from block 1000000 to 15000000
Analyzing tables...
Analyzed 24 tables
Rebuilding table user (estimated 85% removal)
Deleting from table token (estimated 45% removal)
Skipping table config (estimated 10% removal)
Finished pruning entities: deleted 1.2M rows, copied 300K rows in 45m 30s

Ongoing Repruning

For automatic reprune operations, watch Graph Node logs for:

Start message

Start pruning historical entities from block 15005000 to 20000000

Indicates pruning has begun, shows block range.

Analysis message

Analyzed N tables

Reports how many tables were analyzed for pruning.

Completion message

Finished pruning entities: deleted 500K rows, copied 100K rows in 12m 15s

Shows total rows deleted/copied and duration.

Metrics

While Graph Node doesn’t expose dedicated pruning metrics, you can monitor:

# Deployment head growth (should continue during pruning)
rate(deployment_head[5m])

# Block processing time (may temporarily increase during rebuild)
rate(deployment_block_processing_duration_sum[5m]) / rate(deployment_block_processing_duration_count[5m])

Use Cases and Recommendations

Time-Series Data

Scenario: Subgraph tracks daily statistics (e.g., daily trading volume, daily active users). Recommendation:

graphman --config config.toml prune <deployment> --history 30000

Why: Daily statistics are immutable once the day ends. Pruning old entity versions doesn’t affect time-series queries but dramatically reduces storage.

Event Logs

Scenario: Subgraph indexes event logs that are never updated. Recommendation:

graphman --config config.toml prune <deployment> --history 50000

Why: Events don’t have historical versions, so pruning provides storage benefits with minimal impact on queries.

Lifetime Statistics

Scenario: Entities track lifetime statistics (e.g., total token transfers, all-time high price). Consideration: Pruning limits how far back you can generate time-series from these statistics. Recommendation:

graphman --config config.toml prune <deployment> --history 100000

Why: Set history_blocks based on the oldest historical query your application needs. For example, 100,000 blocks provides ~2 weeks of Ethereum mainnet history.

High-Churn Entities

Scenario: Entities update frequently (e.g., token prices, user balances). Recommendation:

graphman --config config.toml prune <deployment> --history 10000

Why: High-churn entities generate many versions quickly. Aggressive pruning (lower history_blocks) provides significant storage and performance benefits.

Archived Deployments

Scenario: Old deployment versions no longer receiving queries. Recommendation:

graphman --config config.toml prune <deployment> --history 1000 --once

Why: Minimal history retention frees maximum storage. Use --once if you plan to remove the deployment entirely soon.

Best Practices

Start with conservative history_blocks

Begin with a higher value (e.g., 50,000 blocks) and reduce it based on your actual query patterns. Monitor query errors to find the right balance.

Consider reorg threshold

Always set history_blocks significantly higher than ETHEREUM_REORG_THRESHOLD to prevent conflicts. Add at least 1000 blocks as a safety margin.

Monitor initial prune duration

The first prune can take considerable time for large deployments. Run it during low-traffic periods if possible.

Test in staging first

Prune a copy of your deployment in a staging environment to estimate duration and verify query compatibility.

Document your history_blocks choice

Record why you chose a specific value so future maintainers understand the tradeoff.

Plan for grafting

If you might need to graft from this deployment, ensure history_blocks is large enough to support your grafting point.

Tune rebuild threshold

For deployments with many tables, adjust REBUILD_THRESHOLD to balance between performance gains and indexing interruptions.

Troubleshooting

Queries return “block not found” errors

Cause: Query requests a block before earliest_block. Solution:

Check the deployment’s earliest_block via the status API
Modify queries to respect this limit
Consider increasing history_blocks if this is too restrictive

Pruning takes too long

Causes and solutions:

Large tables with high removal percentage

Symptom: Rebuild operations taking hours.Solution: This is expected for very large deployments. Consider:

Running initial prune during maintenance windows
Lowering REBUILD_THRESHOLD to use delete strategy instead
Increasing GRAPH_STORE_BATCH_TARGET_DURATION for fewer, larger batches

Frequent repruning

Symptom: Reprune operations running too often.Solution: Increase GRAPH_STORE_HISTORY_SLACK_FACTOR:

GRAPH_STORE_HISTORY_SLACK_FACTOR=2.0

This reduces reprune frequency but allows more history accumulation.

Outdated table statistics

Symptom: Pruning analyzes tables repeatedly or chooses suboptimal strategies.Solution: Manually update Postgres statistics:

ANALYZE sgd123.entity_table;

Indexing blocked during rebuild

Symptom: Deployment stops indexing during prune operations. Expected behavior: This occurs when copying non-final entities during table rebuilds. Solutions:

Reduce rebuild frequency: Increase GRAPH_STORE_HISTORY_SLACK_FACTOR
Use delete strategy: Lower REBUILD_THRESHOLD to favor deletions
Smaller batches: Reduce GRAPH_STORE_BATCH_TARGET_DURATION for shorter blocking periods

Storage not reclaimed after pruning

Cause: Delete strategy marks space as reusable but doesn’t return it to the OS. Solutions:

Run VACUUM

VACUUM sgd123.entity_table;

Marks deleted space as reusable within Postgres.

Run VACUUM FULL (requires downtime)

VACUUM FULL sgd123.entity_table;

Returns space to the OS but locks the table.

Force rebuild strategy

Lower REBUILD_THRESHOLD to trigger rebuilds more aggressively:

GRAPH_STORE_HISTORY_REBUILD_THRESHOLD=0.5

Cannot graft from pruned deployment

Cause: Target graft block is before the deployment’s earliest_block. Solution:

Increase history_blocks before pruning if grafting is planned
Use a different source deployment that hasn’t been pruned
Reindex the source deployment from scratch if necessary

Configuration Reference

Environment Variables

# How much history before triggering reprune (default: 1.5)
GRAPH_STORE_HISTORY_SLACK_FACTOR=1.5

# Rebuild if removing more than this fraction (default: 0.7)
GRAPH_STORE_HISTORY_REBUILD_THRESHOLD=0.7

# Delete if removing more than this fraction (default: 0.3)
GRAPH_STORE_HISTORY_DELETE_THRESHOLD=0.3

# Target duration for each batch in seconds (default: 120)
GRAPH_STORE_BATCH_TARGET_DURATION=120

Graphman Commands

# Initial prune with automatic repruning
graphman --config config.toml prune <deployment> --history <blocks>

# One-time prune without automatic repruning
graphman --config config.toml prune <deployment> --history <blocks> --once

# Disable automatic repruning
graphman --config config.toml prune <deployment> --history 999999999

Additional Resources

Graphman CLI - Complete command reference
Maintenance Operations - Deployment management
Monitoring - Track deployment performance
Configuration - Graph Node configuration

Get Started

Core Concepts

Running Graph Node

Deployment

Advanced Configuration

Operations

​Overview

​Benefits

​Tradeoffs

​How Pruning Works

​History Retention

​Accessing Earliest Block

​Using Graphman Prune

​Initial Pruning

​One-Time Pruning

​Disabling Automatic Pruning

​Automatic Repruning

​How It Works

​Reprune Frequency

​Configuration

​Pruning Strategies

​Delete Strategy

​Rebuild Strategy

​Strategy Selection Thresholds

​Batching and Performance

​Batch Target Duration

​Parallel Operation

​Monitoring Pruning Operations

​Initial Prune

​Ongoing Repruning

​Metrics

​Use Cases and Recommendations

​Time-Series Data

​Event Logs

​Lifetime Statistics

​High-Churn Entities

​Archived Deployments

​Best Practices

​Troubleshooting

​Queries return “block not found” errors

​Pruning takes too long

​Indexing blocked during rebuild

​Storage not reclaimed after pruning

​Cannot graft from pruned deployment

​Configuration Reference

​Environment Variables

​Graphman Commands

​Additional Resources

Build docs developers (and LLMs) love

Overview

Benefits

Tradeoffs

How Pruning Works

History Retention

Accessing Earliest Block

Using Graphman Prune

Initial Pruning

One-Time Pruning

Disabling Automatic Pruning

Automatic Repruning

How It Works

Reprune Frequency

Configuration

Pruning Strategies

Delete Strategy

Rebuild Strategy

Strategy Selection Thresholds

Batching and Performance

Batch Target Duration

Parallel Operation

Monitoring Pruning Operations

Initial Prune

Ongoing Repruning

Metrics

Use Cases and Recommendations

Time-Series Data

Event Logs

Lifetime Statistics

High-Churn Entities

Archived Deployments

Best Practices

Troubleshooting

Queries return “block not found” errors

Pruning takes too long

Indexing blocked during rebuild

Storage not reclaimed after pruning

Cannot graft from pruned deployment

Configuration Reference

Environment Variables

Graphman Commands

Additional Resources