Skip to main content

Overview

Wake-on-request is Hatch’s core serverless feature: freeze idle VMs to zero compute cost, wake them transparently on the next request. This is achieved through three integrated components:
  1. Idle monitor — detects inactive VMs and snapshots them automatically
  2. HTTP proxy wake — restores snapshotted VMs when HTTP requests arrive
  3. SSH gateway wake — restores snapshotted VMs when SSH connections arrive
From the client’s perspective, snapshotted VMs appear slow but never “down”. HTTP requests see increased first-byte time, SSH clients see a slow handshake.

Idle Detection and Auto-Snapshot

The idle monitor runs as a background goroutine that periodically checks all VMs with proxy routes.

Monitoring Loop

// internal/proxy/idle.go:50-63
func (m *IdleMonitor) loop() {
  ticker := time.NewTicker(m.interval)  // Default: 30 seconds
  defer ticker.Stop()
  
  for {
    select {
    case <-m.stopCh:
      return
    case <-ticker.C:
      m.check()  // Scan all VMs
    }
  }
}

Idle Check Logic

1

Get all proxy routes

Only VMs with proxy routes are candidates for idle snapshot
// internal/proxy/idle.go:65-72
routes, err := m.db.ListAllRoutes()
if err != nil {
  slog.Error("idle monitor: list routes", "error", err)
  return
}
2

Check each VM state

Skip VMs that aren’t running
// internal/proxy/idle.go:76-80
vm, ok := m.vmm.Get(route.VMID)
if !ok || vm.State != store.VMStateRunning {
  continue
}
3

Calculate idle time

Compare current time against last proxy access timestamp
// internal/proxy/idle.go:82-91
lastAccess := m.proxy.LastAccessTime(route.Subdomain)
if lastAccess == 0 {
  // Never accessed through proxy; use VM's created_at as baseline
  lastAccess = vm.CreatedAt.Unix()
}

idleSeconds := now - lastAccess
if idleSeconds < int64(m.timeout.Seconds()) {
  continue  // Not idle yet
}
4

Check for active SSH sessions

Read /proc/net/nf_conntrack to detect established SSH connections
// internal/proxy/idle.go:93-97
if vm.SSHPort > 0 && hasActiveSSHSessions(vm.SSHPort) {
  slog.Debug("skipping idle snapshot, active SSH session",
    "vm", route.VMID, "ssh_port", vm.SSHPort)
  continue
}
This prevents snapshotting a VM while someone is actively using it over SSH.
5

Trigger snapshot

Call VM manager to snapshot the idle VM
// internal/proxy/idle.go:99-109
slog.Info("vm idle, triggering snapshot",
  "vm", route.VMID,
  "subdomain", route.Subdomain,
  "idle_seconds", idleSeconds,
)

ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
if _, err := m.vmm.Snapshot(ctx, route.VMID); err != nil {
  slog.Error("idle snapshot failed", "vm", route.VMID, "error", err)
}
cancel()

Active SSH Detection

The idle monitor checks the kernel’s connection tracking table to detect active SSH sessions:
// internal/proxy/idle.go:115-127
func hasActiveSSHSessions(sshPort int) bool {
  data, err := os.ReadFile("/proc/net/nf_conntrack")
  if err != nil {
    return false
  }
  
  needle := fmt.Sprintf("dport=%d", sshPort)
  for _, line := range strings.Split(string(data), "\n") {
    if strings.Contains(line, "ESTABLISHED") && strings.Contains(line, needle) {
      return true  // Active connection found
    }
  }
  return false
}
ipv4     2 tcp      6 431999 ESTABLISHED src=10.0.0.5 dst=192.168.1.10 sport=52314 dport=16000 ...
This indicates an active SSH connection to host port 16000 (the VM’s forwarded SSH port).

Wake-on-HTTP

The proxy server handles wake-on-HTTP when a request arrives for a snapshotted VM.

Request Flow

1

Extract subdomain

Parse Host header to get subdomain (e.g., my-agent.hatch.localmy-agent)
// internal/proxy/proxy.go:52-57
subdomain := p.extractSubdomain(r.Host)
if subdomain == "" {
  http.Error(w, `{"error":"no subdomain in host header"}`, http.StatusBadGateway)
  return
}
2

Look up route

Find VM ID and target port from database
// internal/proxy/proxy.go:59-69
route, err := p.db.GetRouteBySubdomain(subdomain)
if err != nil {
  http.Error(w, `{"error":"internal error"}`, http.StatusBadGateway)
  return
}
if route == nil {
  http.Error(w, `{"error":"no route for subdomain"}`, http.StatusBadGateway)
  return
}
3

Record access time

Update last-access timestamp for idle detection
// internal/proxy/proxy.go:71-72
p.recordAccess(subdomain)
4

Check VM state

Determine if wake is needed
// internal/proxy/proxy.go:74-103
vm, ok := p.vmm.Get(route.VMID)
if !ok {
  http.Error(w, `{"error":"vm not found"}`, http.StatusBadGateway)
  return
}

switch vm.State {
case store.VMStateRunning:
  // Happy path: VM is already running
  
case store.VMStateSnapshotted:
  if !route.AutoWake {
    http.Error(w, `{"error":"vm is snapshotted and auto-wake is disabled"}`,
      http.StatusServiceUnavailable)
    return
  }
  
  // Wake the VM
  if err := p.wakeVM(r.Context(), route.VMID); err != nil {
    http.Error(w, `{"error":"failed to wake vm"}`, http.StatusServiceUnavailable)
    return
  }
  
  // Re-fetch VM after restore
  vm, ok = p.vmm.Get(route.VMID)
  
default:
  http.Error(w, `{"error":"vm is in state ..., not proxying"}`,
    http.StatusServiceUnavailable)
  return
}
5

Reverse proxy

Forward request to VM’s guest IP and target port
// internal/proxy/proxy.go:110-124
target := &url.URL{
  Scheme: "http",
  Host:   net.JoinHostPort(vm.GuestIP, fmt.Sprintf("%d", route.TargetPort)),
}

rp := &httputil.ReverseProxy{
  Director: func(req *http.Request) {
    req.URL.Scheme = target.Scheme
    req.URL.Host = target.Host
    req.Host = r.Host  // Preserve original Host header
  },
}

rp.ServeHTTP(w, r)

Wake VM Implementation

The wakeVM function serializes concurrent wake requests for the same VM:
// internal/proxy/proxy.go:128-160
func (p *Proxy) wakeVM(ctx context.Context, vmID string) error {
  // Get or create a per-VM mutex
  val, _ := p.wakeMu.LoadOrStore(vmID, &sync.Mutex{})
  mu := val.(*sync.Mutex)
  mu.Lock()
  defer mu.Unlock()
  
  // Re-check state under the lock: another request may have already restored it
  vm, ok := p.vmm.Get(vmID)
  if !ok {
    return fmt.Errorf("vm not found: %s", vmID)
  }
  if vm.State == store.VMStateRunning {
    return nil  // Already restored by a concurrent request
  }
  
  ctx, cancel := context.WithTimeout(ctx, p.wakeTimeout)  // Default: 2 minutes
  defer cancel()
  
  slog.Info("waking snapshotted vm", "vm", vmID)
  _, err := p.vmm.Restore(ctx, vmID)
  if err != nil {
    // Mark VM as error so subsequent queued requests don't retry
    // the same failing restore in a tight loop
    p.vmm.MarkError(vmID, err)
    return fmt.Errorf("restore vm %s: %w", vmID, err)
  }
  
  // Allow a brief moment for the restored VM's network to come up
  time.Sleep(500 * time.Millisecond)
  return nil
}
The per-VM mutex ensures that if 10 requests arrive simultaneously for a snapshotted VM, only one restore operation runs. The other 9 requests wait on the mutex and return immediately when they acquire it (VM is already running).

Wake-on-SSH

The SSH gateway listens on all active SSH ports and can wake VMs before forwarding connections.

Reconciliation Loop

The gateway periodically syncs its listener set with the database:
// internal/proxy/ssh_gateway.go:76-130
func (g *SSHGateway) reconcile() {
  vms, err := g.db.ListVMs()
  if err != nil {
    return
  }
  
  // Build desired listener set
  want := make(map[int]struct{})
  for i := range vms {
    if p := vms[i].SSHPort; p > 0 {
      want[p] = struct{}{}
    }
  }
  
  // Add missing listeners
  for p := range want {
    if _, ok := current[p]; ok {
      continue  // Already listening
    }
    
    ln, err := net.Listen("tcp", net.JoinHostPort("0.0.0.0", strconv.Itoa(p)))
    if err != nil {
      slog.Warn("ssh gateway: listen failed", "port", p, "error", err)
      continue
    }
    
    g.listeners[p] = ln
    slog.Info("ssh gateway: listening", "port", p)
    go g.servePort(p, ln)
  }
  
  // Remove stale listeners
  for p, ln := range current {
    if _, ok := want[p]; ok {
      continue
    }
    ln.Close()
    delete(g.listeners, p)
  }
}

Connection Handling

1

Accept connection

// internal/proxy/ssh_gateway.go:132-147
func (g *SSHGateway) servePort(port int, ln net.Listener) {
  for {
    conn, err := ln.Accept()
    if err != nil {
      return  // Listener closed
    }
    go g.handleConn(port, conn)
  }
}
2

Look up VM by SSH port

// internal/proxy/ssh_gateway.go:149-156
func (g *SSHGateway) handleConn(port int, conn net.Conn) {
  defer conn.Close()
  
  vm, err := g.db.GetVMBySSHPort(port)
  if err != nil || vm == nil {
    return
  }
3

Wake if snapshotted

// internal/proxy/ssh_gateway.go:158-174
switch vm.State {
case store.VMStateRunning:
  // Ready to proxy
  
case store.VMStateSnapshotted:
  if err := g.wakeVM(context.Background(), vm.ID); err != nil {
    slog.Error("ssh gateway: wake vm failed", "vm", vm.ID, "error", err)
    return
  }
  
  vm, err = g.db.GetVM(vm.ID)
  if err != nil || vm == nil || vm.State != store.VMStateRunning {
    return
  }
  
default:
  return  // VM in error/stopped state
}
4

Dial guest SSH daemon

// internal/proxy/ssh_gateway.go:176-185
upstream, err := net.DialTimeout("tcp",
  net.JoinHostPort(vm.GuestIP, "22"), 10*time.Second)
if err != nil {
  slog.Debug("ssh gateway: dial guest failed", "error", err)
  return
}
defer upstream.Close()
5

Bidirectional TCP pipe

// internal/proxy/ssh_gateway.go:187
pipeBidirectional(conn, upstream)
This creates two goroutines to copy data in both directions:
// internal/proxy/ssh_gateway.go:230-251
func pipeBidirectional(a, b net.Conn) {
  var wg sync.WaitGroup
  wg.Add(2)
  
  go func() {
    defer wg.Done()
    io.Copy(a, b)
    if tc, ok := a.(*net.TCPConn); ok {
      tc.CloseWrite()
    }
  }()
  
  go func() {
    defer wg.Done()
    io.Copy(b, a)
    if tc, ok := b.(*net.TCPConn); ok {
      tc.CloseWrite()
    }
  }()
  
  wg.Wait()
}

Concurrent Wake Request Handling

Both the proxy and SSH gateway use per-VM mutexes to serialize wake operations:
// Shared pattern in both proxy.go and ssh_gateway.go
type Proxy struct {
  wakeMu sync.Map  // map[vmID string]*sync.Mutex
  // ...
}

func (p *Proxy) wakeVM(ctx context.Context, vmID string) error {
  // Get or create mutex for this VM
  val, _ := p.wakeMu.LoadOrStore(vmID, &sync.Mutex{})
  mu := val.(*sync.Mutex)
  
  mu.Lock()
  defer mu.Unlock()
  
  // Double-check: another goroutine may have already restored
  vm, _ := p.vmm.Get(vmID)
  if vm.State == store.VMStateRunning {
    return nil  // Already awake
  }
  
  // Perform restore (only one goroutine gets here)
  p.vmm.Restore(ctx, vmID)
}

Example Timeline

1

T+0ms: 5 requests arrive

All for the same snapshotted VM vm-abc123
2

T+1ms: First request acquires mutex

  • Request 1: Acquires mutex, checks state (snapshotted), starts restore
  • Requests 2-5: Block on mutex
3

T+5000ms: Restore completes

  • Request 1: Restore finishes, VM state → running, releases mutex
  • Request 2: Acquires mutex, checks state (running), skips restore, releases mutex
  • Request 3: Acquires mutex, checks state (running), skips restore, releases mutex
4

T+5005ms: All requests proceed

All 5 requests now proxy to the running VM
This pattern prevents “thundering herd” restore attempts and ensures exactly-once semantics for wake operations.

Configuration

Wake-on-request behavior is controlled by environment variables:
VariableDefaultDescription
HATCH_IDLE_CHECK_INTERVAL30sHow often the idle monitor checks VMs
HATCH_IDLE_TIMEOUT5mIdle duration before auto-snapshot
HATCH_WAKE_TIMEOUT2mMax time to wait for restore to complete

Per-Route Auto-Wake Flag

Each proxy route has an auto_wake boolean:
POST /vms/{id}/routes
{
  "subdomain": "my-agent",
  "target_port": 3000,
  "auto_wake": true    # Enable wake-on-HTTP for this route
}
If auto_wake: false, the proxy returns 503 Service Unavailable when the VM is snapshotted instead of waking it.

Client Experience

HTTP Request to Snapshotted VM

$ time curl https://my-agent.hatch.local
{"status": "ok"}

real    0m6.234s    # ~6 seconds (restore time + request time)
user    0m0.008s
sys     0m0.004s
Subsequent requests:
$ time curl https://my-agent.hatch.local
{"status": "ok"}

real    0m0.234s    # Normal latency
user    0m0.008s
sys     0m0.004s

SSH Connection to Snapshotted VM

$ time ssh -p 16000 hatch@host
# Hangs for ~5 seconds while VM wakes
Welcome to Ubuntu 22.04.3 LTS
...

hatch@vm-123:~$ 
The SSH client sees this as a slow handshake but the connection succeeds normally.

Performance Tuning

Reduce Idle Timeout

Lower HATCH_IDLE_TIMEOUT to snapshot VMs more aggressively:
HATCH_IDLE_TIMEOUT=1m  # Snapshot after 1 minute idle
Trade-off: More frequent snapshots = more S3 operations

Increase Check Interval

Raise HATCH_IDLE_CHECK_INTERVAL to reduce CPU overhead:
HATCH_IDLE_CHECK_INTERVAL=2m  # Check every 2 minutes
Trade-off: Longer delay before idle VMs are snapshotted

Optimize S3 Latency

Use MinIO on the same host for development:
HATCH_S3_ENDPOINT=http://localhost:9000
Or choose an S3 region close to your Hatch host

Pre-Warm VMs

For latency-sensitive workloads, keep VMs running:
# Disable auto-wake for critical routes
{"auto_wake": false}

# Or set very long idle timeout
HATCH_IDLE_TIMEOUT=24h

Monitoring

Logs include structured events for wake operations:
{"level":"info","msg":"vm idle, triggering snapshot","vm":"vm-abc123","subdomain":"my-agent","idle_seconds":320}
{"level":"info","msg":"snapshot created","vm":"vm-abc123","snapshot":"snap-xyz789"}
{"level":"info","msg":"waking snapshotted vm","vm":"vm-abc123"}
{"level":"info","msg":"vm restored","vm":"vm-abc123"}
Watch these logs to understand idle/wake patterns and tune timeouts.

Troubleshooting

  • Check VM has a proxy route: GET /vms/{id}/routes
  • Verify idle monitor is running: look for “idle monitor started” in logs
  • Check for active SSH sessions: cat /proc/net/nf_conntrack | grep ESTABLISHED
  • Ensure S3 is configured: idle snapshot requires S3 storage
  • Check S3 connectivity: aws s3 ls s3://{bucket}/snapshots/{vm_id}/
  • Verify latest snapshot exists in database: GET /vms/{id}/snapshots
  • Check disk space on host: restore needs space for memory + disk delta
  • Review VM logs: snapshot may have corrupted state
  • This indicates a bug in the per-VM mutex logic
  • Check logs for “waking snapshotted vm” — should only appear once per wake
  • Verify VM state is correctly updated to running after restore
  • Verify SSH gateway is running: “ssh wake gateway started” in logs
  • Check gateway is listening on VM’s SSH port: netstat -tlnp | grep {ssh_port}
  • Ensure VM has ssh_port set in database
  • Test restore manually: POST /vms/{id}/restore

Build docs developers (and LLMs) love