Architecture Diagram
Overview
BuildBuddy implements the Remote Execution API, allowing Bazel to offload build action execution to a pool of remote workers. This provides massive parallelism, consistent build environments, and efficient resource utilization.Components Involved
Bazel Client
The build tool requesting remote execution:- Analyzes build graph locally
- Uploads inputs to CAS
- Sends execution requests
- Downloads outputs from cache
- Monitors execution progress
Execution Service (Scheduler)
Orchestrates remote execution:- Receives Execute requests
- Validates action requirements
- Matches actions to executors
- Manages execution queue
- Tracks action lifecycle
- Returns execution results
Executor Pool
Worker machines that run actions:- Registers with scheduler
- Declares capabilities (platform properties)
- Receives action assignments
- Executes commands in isolation
- Uploads outputs to cache
- Reports execution status
Redis Queue
Manages task distribution:- Queues pending actions
- Priority-based scheduling
- Ensures fair distribution
- Handles executor failures
Content Addressable Storage (CAS)
Stores action inputs and outputs:- Input files downloaded by executors
- Output files uploaded after execution
- Digest-based addressing
- Shared across all executors
Action Cache
Stores execution results:- Checks for cached results before execution
- Stores results after execution
- Enables build action reuse
Execution Flow
Step 1: Action Preparation
-
Local Analysis:
- Bazel analyzes build graph
- Identifies actions to execute
- Determines action inputs and commands
-
Input Upload:
- Bazel computes input file digests
- Checks which inputs are missing from CAS
- Uploads missing inputs to BuildBuddy
- Creates input root directory structure
-
Action Digest Computation:
- Computes action hash from:
- Command line and arguments
- Input file digests
- Environment variables
- Platform properties
- Computes action hash from:
Step 2: Cache Check
-
GetActionResult Request:
- Bazel checks Action Cache first
- Sends action digest to BuildBuddy
-
Cache Hit Path:
- If cached result exists:
- Return ActionResult immediately
- Bazel skips execution
- Downloads outputs from CAS
- Continues to next action
- If cached result exists:
-
Cache Miss Path:
- If no cached result:
- Proceed to remote execution
- If no cached result:
Step 3: Execute Request
-
Bazel Sends Execute RPC:
-
Scheduler Receives Request:
- Validates action format
- Authenticates request
- Extracts platform requirements
- Assigns unique task ID
Step 4: Task Scheduling
-
Queue Action:
- Add to Redis queue
- Priority based on:
- User priority settings
- Action size/complexity
- Queue time (fairness)
-
Executor Matching:
- Find executor with matching platform
- Check executor capacity
- Consider executor health/performance
- Assign task to executor
-
Task Assignment:
- Notify executor of new task
- Executor claims task
- Update task status to RUNNING
Step 5: Action Execution
-
Input Preparation:
- Executor downloads input root from CAS
- Reconstructs directory structure
- Downloads all input files
- Verifies input digests
-
Environment Setup:
- Create isolated execution environment:
- Docker container, or
- Podman container, or
- Firecracker VM, or
- Bare metal with sandbox
- Set environment variables
- Configure working directory
- Create isolated execution environment:
-
Command Execution:
- Run the command (e.g., compiler, linker)
- Capture stdout and stderr
- Monitor resource usage
- Enforce timeout
- Record exit code
-
Output Collection:
- Identify output files
- Compute output digests
- Prepare ActionResult
-
Output Upload:
- Upload output files to CAS
- Upload stdout/stderr if requested
- Ensure all outputs uploaded before completing
Step 6: Result Reporting
-
Update Action Cache:
- Store action digest → ActionResult mapping
- Unless do_not_cache=true
- Future executions will cache hit
-
Send ExecuteResponse:
-
Bazel Receives Result:
- Checks exit code
- Downloads output files from CAS
- Continues build with outputs
- Or reports action failure
Step 7: Output Download
- Bazel receives output file digests
- Downloads outputs from CAS
- Places files in local build directory
- Proceeds to dependent actions
Executor Management
Executor Registration
-
Executor Startup:
- Executor process starts on worker machine
- Connects to BuildBuddy scheduler
- Registers capabilities:
- Platform properties (OS, arch, etc.)
- Resource capacity (CPU, memory, disk)
- Container/VM support
-
Health Monitoring:
- Periodic heartbeats to scheduler
- Reports current load and availability
- Updates capability changes
-
Deregistration:
- Graceful shutdown drains tasks
- Notifies scheduler of unavailability
- Scheduler reassigns pending tasks
Platform Properties
Executors advertise capabilities:Isolation Mechanisms
Docker Containers:- Each action runs in fresh container
- Specified by container-image property
- Provides filesystem isolation
- Manages resource limits
- Rootless container execution
- Better security isolation
- Compatible with Docker images
- Lightweight microVMs
- Stronger isolation than containers
- Fast startup (sub-second)
- Used for untrusted code
- Sandboxing without containers
- Faster for trusted code
- Limited isolation
Performance Optimizations
Input Deduplication
- Content addressing eliminates duplicate uploads
- Common inputs (toolchains, SDKs) uploaded once
- Executors cache frequently used inputs locally
Persistent Workers
For JVM-based tools (Java, Kotlin, Scala):- Keep compiler process running between actions
- Avoid JVM startup overhead
- Warm JIT compilation
- Significant speedup for incremental builds
Local Execution Cache
Executor maintains local cache:- Input files cached on disk
- Container images cached
- Avoids repeated CAS downloads
- LRU eviction when disk fills
Action Prioritization
- Critical path actions prioritized
- Large actions scheduled early
- Fair queuing prevents starvation
- Priority can be set per user/org
Speculative Execution
For slow actions:- Execute same action on multiple executors
- Use result from first to complete
- Cancel redundant executions
- Reduces tail latency
Failure Handling
Executor Failures
-
Executor Crash:
- Heartbeat timeout detected
- Scheduler marks executor unhealthy
- Reschedules in-progress actions
-
Network Partition:
- Executor isolated from scheduler
- Actions eventually timeout
- Executor re-registers on reconnect
Action Failures
-
Command Failure (non-zero exit code):
- Result returned with exit code
- Bazel handles as normal build failure
- Logs available for debugging
-
Timeout:
- Action exceeds timeout
- Executor kills process
- Returns DEADLINE_EXCEEDED error
-
Resource Exhaustion:
- Out of memory, disk space
- Executor fails action
- May retry on different executor
Retries
- Transient errors (network, executor failure) retried automatically
- Configurable retry limits
- Exponential backoff
- Non-transient errors (command failure) not retried
Monitoring and Metrics
Execution Metrics
- Actions queued, running, completed
- Queue time (time waiting for executor)
- Execution time (time running on executor)
- Upload/download time and bytes
- Cache hit rate (action cache)
- Executor utilization
Performance Metrics
- End-to-end execution latency (p50, p95, p99)
- Input download time
- Output upload time
- Scheduler overhead
Reliability Metrics
- Action failure rate (by type)
- Executor failure rate
- Retry rate
- Timeout rate