Skip to main content

Overview

The archiver service automatically moves old check results from your database to S3 storage as compressed Parquet files. This keeps your database lean while preserving historical data for long-term analysis.

How It Works

  1. Scheduled Job: Runs on a cron schedule (default: 3am daily)
  2. Batch Processing: Archives check results older than the retention period in configurable batches
  3. Parquet Export: Converts rows to columnar Parquet format for efficient storage and querying
  4. S3 Upload: Uploads partitioned files to S3 with day-based folder structure
  5. Database Cleanup: Removes successfully archived rows from the database

Partitioning Structure

Check results are organized by date for efficient querying:
s3://your-bucket/pongo/archives/
├── year=2025/
│   ├── month=12/
│   │   ├── day=01/
│   │   │   └── pongo_check_results_[batch-id].parquet
│   │   ├── day=02/
│   │   │   └── pongo_check_results_[batch-id].parquet
│   │   └── day=03/
│   │       └── pongo_check_results_[batch-id].parquet
This structure enables:
  • Partition pruning: Query engines can skip irrelevant date ranges
  • Cost optimization: Only scan the data you need
  • Easy deletion: Remove entire date partitions when no longer needed

Archiver Service

The archiver runs as a standalone process:
bun archiver

Docker Integration

Enable archiving in Docker deployments with:
ENV ARCHIVAL_ENABLED=true
The entrypoint script automatically starts the archiver alongside the scheduler when ARCHIVAL_ENABLED=true.

Configuration

Environment Variables

VariableDescriptionDefault
ARCHIVAL_ENABLEDEnable archival servicefalse
ARCHIVAL_RETENTION_DAYSDays to keep in database before archiving30
ARCHIVAL_CRONCron schedule for archival job0 3 * * * (3am daily)
ARCHIVAL_BATCH_SIZERows to archive per batch10000
ARCHIVAL_LOCAL_PATHTemporary local storage for Parquet files./archives
ARCHIVER_PORTHTTP API port3002

S3 Configuration

All S3 environment variables are required for archival to work. Without S3 configuration, the archiver will fail.
VariableDescriptionExample
S3_BUCKETS3 bucket namemy-pongo-archives
S3_REGIONAWS regionus-east-1
S3_ACCESS_KEY_IDAWS access key IDAKIAIOSFODNN7EXAMPLE
S3_SECRET_ACCESS_KEYAWS secret access keywJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_PREFIXS3 key prefix (optional)pongo/archives

Example Configuration

# Enable archival
ARCHIVAL_ENABLED=true

# Archive data older than 30 days
ARCHIVAL_RETENTION_DAYS=30

# Run archival at 3am UTC daily
ARCHIVAL_CRON=0 3 * * *

# Process 10,000 rows per batch
ARCHIVAL_BATCH_SIZE=10000

# Local temporary storage
ARCHIVAL_LOCAL_PATH=./archives

# S3 configuration
S3_BUCKET=my-pongo-archives
S3_REGION=us-east-1
S3_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
S3_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
S3_PREFIX=pongo/archives

Archival Process

1. Select Eligible Rows

The archiver queries for check results where:
  • checkedAt < (now - ARCHIVAL_RETENTION_DAYS)
  • archivedAt IS NULL (not previously archived)

2. Mark as Archiving

Rows are immediately marked with archivedAt = now() to prevent duplicate archival in concurrent runs.

3. Partition by Day

Rows are grouped by the date of their checkedAt timestamp:
const partitionKey = formatPartitionKey(checkedAt); // "2025-12-01"

4. Write Parquet Files

Each partition is written to a local Parquet file:
./archives/year=2025/month=12/day=01/pongo_check_results_[batch-id].parquet

5. Upload to S3

Parquet files are uploaded to S3 with the same partition structure:
s3://bucket/pongo/archives/year=2025/month=12/day=01/pongo_check_results_[batch-id].parquet

6. Cleanup

  • Successful rows: Deleted from the database
  • Failed rows: archivedAt is reset to NULL for retry on the next run
  • Local files: Deleted after successful S3 upload

Parquet Schema

Archived check results include:
ColumnTypeDescription
idstringUnique check result ID
monitorIdstringMonitor identifier
statusstring"up", "down", or "degraded"
responseTimeMsnumberResponse time in milliseconds
statusCodenumberHTTP status code (if applicable)
messagestringAdditional context or error message
checkedAttimestampWhen the check was performed
createdAttimestampWhen the record was created

Querying Archived Data

Use AWS Athena, DuckDB, or any Parquet-compatible tool:

AWS Athena Example

CREATE EXTERNAL TABLE pongo_check_results (
  id STRING,
  monitorId STRING,
  status STRING,
  responseTimeMs BIGINT,
  statusCode INT,
  message STRING,
  checkedAt TIMESTAMP,
  createdAt TIMESTAMP
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://your-bucket/pongo/archives/';

MSCK REPAIR TABLE pongo_check_results;

SELECT monitorId, AVG(responseTimeMs) as avg_response
FROM pongo_check_results
WHERE year=2025 AND month=12 AND day=01
GROUP BY monitorId;

DuckDB Example

SELECT monitorId, AVG(responseTimeMs) as avg_response
FROM read_parquet('s3://your-bucket/pongo/archives/year=2025/month=12/*/*.parquet')
WHERE status = 'up'
GROUP BY monitorId;

Monitoring Archival

The archiver exposes an HTTP API on port 3002 (configurable via ARCHIVER_PORT):
EndpointMethodDescription
/healthGETHealth check
/triggerPOSTManually trigger archival

Manual Trigger

curl -X POST http://localhost:3002/trigger

Best Practices

Tune Retention Period

Balance database size with query performance needs. 30 days works for most use cases.

Monitor Batch Size

Larger batches reduce overhead but increase memory usage. Start with 10,000.

Schedule Off-Peak

Run archival during low-traffic periods to minimize database impact.

Set S3 Lifecycle Rules

Configure S3 lifecycle policies to automatically delete old archives or transition to Glacier.

Troubleshooting

Archival Not Running

  1. Verify ARCHIVAL_ENABLED=true
  2. Check all S3 environment variables are set
  3. Review archiver logs for errors

Large Database Despite Archival

  1. Check ARCHIVAL_RETENTION_DAYS is appropriate
  2. Verify archival cron schedule is running
  3. Manually trigger archival to test: curl -X POST http://localhost:3002/trigger

S3 Upload Failures

  1. Verify IAM credentials have s3:PutObject permission
  2. Check bucket name and region are correct
  3. Ensure bucket exists and is accessible

Deployment

Deploy the archiver service in production

Environment Variables

Complete list of configuration options

Database

Learn about database schema and management

Build docs developers (and LLMs) love