Overview
S2 Lite provides built-in monitoring endpoints for health checks and metrics collection, making it easy to integrate with your observability stack.
Health Checks
The /health endpoint provides a simple way to check if S2 Lite is running and ready to accept requests.
Endpoint Details
- URL:
/health
- Method: GET
- Success Response: HTTP 200 OK
- Use Cases: Readiness probes, liveness probes, load balancer health checks
Example Usage
curl -f http://localhost:8080/health
Kubernetes Probes
apiVersion: v1
kind: Pod
metadata:
name: s2-lite
spec:
containers:
- name: s2-lite
image: ghcr.io/s2-streamstore/s2:latest
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 80
initialDelaySeconds: 5
periodSeconds: 5
Prometheus Metrics
S2 Lite exposes internal metrics in Prometheus text format at the /metrics endpoint.
The /metrics endpoint returns Prometheus-formatted metrics for operational monitoring, not business metrics like storage or throughput.
Available Metrics
S2 Lite tracks the following operational metrics:
Append Latency Metrics
s2_append_permit_latency_seconds
- Type: Histogram
- Description: Time taken to acquire permission to append
- Buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s
s2_append_ack_latency_seconds
- Type: Histogram
- Description: End-to-end append acknowledgment latency
- Buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s
Batch Size Metrics
s2_append_batch_records
- Type: Histogram
- Description: Number of records per append batch
- Buckets: 1, 10, 50, 100, 250, 500, 1000
s2_append_batch_bytes
- Type: Histogram
- Description: Size in bytes of append batches
- Buckets: 512B, 1KB, 4KB, 16KB, 64KB, 256KB, 512KB, 1MB
Scraping Metrics
curl http://localhost:8080/metrics
Example output:
# HELP s2_append_ack_latency_seconds Append ack latency in seconds
# TYPE s2_append_ack_latency_seconds histogram
s2_append_ack_latency_seconds_bucket{le="0.005"} 145
s2_append_ack_latency_seconds_bucket{le="0.01"} 289
s2_append_ack_latency_seconds_bucket{le="0.025"} 312
s2_append_ack_latency_seconds_bucket{le="0.05"} 315
s2_append_ack_latency_seconds_bucket{le="+Inf"} 320
s2_append_ack_latency_seconds_sum 2.456
s2_append_ack_latency_seconds_count 320
# HELP s2_append_batch_bytes Append batch size in bytes
# TYPE s2_append_batch_bytes histogram
s2_append_batch_bytes_bucket{le="512"} 45
s2_append_batch_bytes_bucket{le="1024"} 120
s2_append_batch_bytes_bucket{le="4096"} 280
...
Prometheus Configuration
Prometheus Scrape Config
Add S2 Lite as a scrape target in prometheus.yml:
scrape_configs:
- job_name: 's2-lite'
scrape_interval: 15s
static_configs:
- targets: ['localhost:8080']
labels:
service: 's2-lite'
environment: 'production'
Kubernetes ServiceMonitor
If using Prometheus Operator:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: s2-lite
labels:
app: s2-lite
spec:
selector:
matchLabels:
app: s2-lite
endpoints:
- port: http
path: /metrics
interval: 30s
Helm Chart Configuration
When using the S2 Lite Helm chart:
metrics:
serviceMonitor:
enabled: true
interval: 30s
labels:
prometheus: kube-prometheus
Grafana Dashboards
Sample Queries
Average Append Latency (P50, P95, P99)
# P50
histogram_quantile(0.50, rate(s2_append_ack_latency_seconds_bucket[5m]))
# P95
histogram_quantile(0.95, rate(s2_append_ack_latency_seconds_bucket[5m]))
# P99
histogram_quantile(0.99, rate(s2_append_ack_latency_seconds_bucket[5m]))
Append Rate
rate(s2_append_ack_latency_seconds_count[5m])
Average Batch Size (Records)
rate(s2_append_batch_records_sum[5m]) / rate(s2_append_batch_records_count[5m])
Average Batch Size (Bytes)
rate(s2_append_batch_bytes_sum[5m]) / rate(s2_append_batch_bytes_count[5m])
Example Grafana Panel
{
"title": "Append Latency (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(s2_append_ack_latency_seconds_bucket[5m]))",
"legendFormat": "P95 Latency"
}
],
"yaxes": [
{
"format": "s",
"label": "Latency"
}
]
}
API Metrics (Cloud Only)
The /metrics API endpoint for basin and stream metrics is not supported in S2 Lite. These metrics are only available on the S2 cloud service.
For programmatic access to business metrics (storage, throughput, operations), use the S2 cloud service:
use s2_sdk::types::{
AccountMetricSet, BasinMetricSet, StreamMetricSet,
TimeRange, TimeRangeAndInterval, TimeseriesInterval,
};
// Account-level metrics
let metrics = client.get_account_metrics(
GetAccountMetricsInput::new(
AccountMetricSet::AccountOps(time_range_and_interval(24, None))
)
).await?;
// Basin-level metrics
let metrics = client.get_basin_metrics(
GetBasinMetricsInput::new(
basin_name,
BasinMetricSet::AppendThroughput(time_range_and_interval(1, None))
)
).await?;
// Stream-level metrics
let metrics = client.get_stream_metrics(
GetStreamMetricsInput::new(
basin_name,
stream_name,
StreamMetricSet::Storage(time_range(1))
)
).await?;
Alerting
Sample Prometheus Alerts
groups:
- name: s2-lite
interval: 30s
rules:
- alert: HighAppendLatency
expr: histogram_quantile(0.95, rate(s2_append_ack_latency_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High append latency detected"
description: "P95 append latency is {{ $value }}s (threshold: 1s)"
- alert: S2LiteDown
expr: up{job="s2-lite"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "S2 Lite instance is down"
description: "S2 Lite instance {{ $labels.instance }} has been down for 2 minutes"
Logging
S2 Lite outputs structured logs to stdout. Configure log level using the RUST_LOG environment variable:
# Info level (default)
RUST_LOG=info s2 lite
# Debug level for troubleshooting
RUST_LOG=debug s2 lite
# Specific module logging
RUST_LOG=s2_lite=debug,slatedb=info s2 lite
In production, set RUST_LOG=info or RUST_LOG=warn to reduce log volume.
SlateDB Configuration
S2 Lite uses SlateDB as its storage engine. Configure SlateDB settings using SL8_ prefixed environment variables:
# Flush interval (defaults to 50ms for remote, 5ms in-memory)
export SL8_FLUSH_INTERVAL=10ms
# Other SlateDB settings
# See: https://docs.rs/slatedb/latest/slatedb/config/struct.Settings.html
Lower flush intervals improve write latency but may increase object storage API calls.
Monitoring Best Practices
- Set up health checks - Use
/health for liveness and readiness probes
- Monitor append latency - Track P95 and P99 latency to detect performance degradation
- Alert on downtime - Configure alerts when S2 Lite becomes unavailable
- Track batch sizes - Understand your workload patterns
- Use structured logging - Enable JSON logging for better log aggregation
- Monitor resource usage - Track CPU, memory, and network metrics at the infrastructure level