System initialization
The monitoring service starts with theguard command, which initializes all components in the correct order.
Startup sequence
- Database connection: Establish PostgreSQL/TimescaleDB connection pool
- Redis connection: Connect to Redis for URL caching
- Orchestrator creation: Initialize the orchestrator with connections
- Event bus setup: Create event bus and register listeners
- Supervisor activation: Start the supervisor’s batching goroutine
- Worker creation: Create parent workers for each monitoring interval
- Redis prefill: Load URLs from database into Redis for fast access
- Start monitoring: Begin ticking intervals to trigger checks
Listeners are registered before workers start, ensuring no events are missed during initialization.
Monitoring cycle
Each monitoring interval runs independently with its own ticker and worker pool.Tick signal flow
- Ticker fires: Go’s
time.Tickerfires at the configured interval (e.g., every 300 seconds forfive_minutes) - Signal sent: Orchestrator sends
trueto the parent worker’s signal channel - URL fetch: Parent worker fetches all URL IDs from Redis list
urls_interval_{seconds} - Work chunking: URLs are divided into chunks based on
MAXIMUM_WORK_POOL_SIZE - Distribution: Chunks are pushed to the work pool channel for child workers
HTTP check execution
Child workers continuously process URL chunks from the work pool.Worker processing flow
Evaluate result
Worker determines health based on response:
- Network error: Marks as unhealthy
- Status 200-299: Marks as healthy
- Other status codes: Marks as unhealthy
Workers operate concurrently, allowing multiple URLs to be checked simultaneously. The number of concurrent checks is controlled by
MAXIMUM_CHILD_WORKERS.Supervisor decision logic
The supervisor receives tasks from all workers and applies batching for efficiency.Batching mechanism
The supervisor uses a two-trigger batching system:Batch size trigger
Batch size trigger
When the buffer reaches This ensures high-throughput processing during peak monitoring periods.
SUPERVISOR_POOL_FLUSH_BATCHSIZE tasks (default: 100), the supervisor immediately flushes:Timeout trigger
Timeout trigger
A ticker fires every This prevents tasks from sitting in the buffer indefinitely during low-traffic periods.
SUPERVISOR_POOL_FLUSH_TIMEOUT seconds (default: 5) to flush incomplete batches:Event publishing
When flushing, the supervisor publishes domain events to the event bus:Event handling
The event bus dispatches events to registered listeners asynchronously.Event dispatch flow
Each listener runs in its own goroutine, allowing parallel processing of side effects without blocking the supervisor.
Successful ping handling
When a URL check succeeds, thePingSuccessfulListener handles the event.
Recovery workflow
The listener follows this sequence:- Fetch URL metadata: Retrieve current status from the
urlstable - Detect recovery: Check if
url.Status == enums.UnHealthy - Resolve incident: Mark the incident as resolved in the
incidentstable - Send notification: Email the contact address with recovery details
- Persist metrics: Insert status record into
url_statuseshypertable - Update status: Set URL status to
healthyin theurlstable
Recovery notifications are only sent when transitioning from unhealthy to healthy, preventing spam on already-healthy URLs.
Unsuccessful ping handling
When a URL check fails, thePingUnSuccessfulListener handles the event.
Failure workflow
The listener follows this sequence:- Fetch URL metadata: Retrieve current status from the
urlstable - Detect failure: Check if
url.Status == enums.Healthy - Create incident: Log a new incident in the
incidentstable - Send notification: Email the contact address with downtime details
- Update status: Set URL status to
unhealthyin theurlstable - Persist metrics: Insert status record into
url_statuseshypertable
Data model
Watchdog uses three main database tables to store monitoring data.URLs table (metadata)
Schema:migrations/20251115112101_create_urls_table.sql
database/url_model.go:9-18
URL statuses table (time-series)
Schema:migrations/20251208081023_create_url_status_table.sql
time: When the check was performedurl_id: Which URL was checkedstatus:truefor healthy (2xx),falsefor unhealthy
database/url_status_model.go:8-12
Incidents table
Tracks downtime incidents and their resolution:- Created when a URL transitions from healthy to unhealthy
- Resolved when the URL recovers
- Used for incident reporting and uptime calculations
Complete flow diagram
Here’s the complete flow from tick to notification:Performance considerations
Concurrency limits
The system provides several configuration options to control concurrency:MAXIMUM_CHILD_WORKERS: Controls how many workers process URLs concurrently per intervalMAXIMUM_WORK_POOL_SIZE: Limits chunk size to prevent channel overflowSUPERVISOR_POOL_FLUSH_BATCHSIZE: Batches events for efficient processing
Redis caching
URLs are cached in Redis to minimize database queries:- Lists: Store URL IDs grouped by interval for quick enumeration
- Hashes: Store complete URL objects for worker access
- Refresh: Redis is repopulated when URLs are added or removed
Error handling
The system handles errors at multiple levels:- Network errors: Treated as unhealthy checks, reported to supervisor
- Redis errors: Logged and operation skipped for that cycle
- Database errors: Logged with structured logging, event handling continues
- Email errors: Logged but don’t block metric persistence
Next steps
Architecture overview
Review the high-level system design
Component details
Explore individual component implementations