Overview
Herald Daemon
Claims new computations from Kingdom
Mill Job Scheduler
Schedules Kubernetes Jobs for computation stages
Computations Cleaner
Removes old computation data (CronJob)
Herald Daemon
Image:duchy/heraldDeployment Name:
{duchy-name}-herald-daemonType: Continuous deployment
Purpose
The Herald is the duchy’s agent for discovering and claiming work from the Kingdom. It continuously monitors the Kingdom’s System API for new computations assigned to this duchy and initializes them in the local duchy database.Implementation
Implemented insrc/main/kotlin/org/wfanet/measurement/duchy/herald/:
File: Herald.kt
Responsibilities
Work Discovery
Work Discovery
The Herald polls the Kingdom System API to discover new computations:
- Streams active computations from Kingdom
- Filters for computations where this duchy is a participant
- Identifies computations not yet known locally
- Detects state changes in existing computations
Work Claiming
Work Claiming
Claims computations by:
- Creating computation records in local Spanner database
- Confirming participation with Kingdom
- Initializing computation tokens for work locking
- Setting initial computation stage
- Storing protocol-specific configuration
State Synchronization
State Synchronization
Keeps duchy state in sync with Kingdom:
- Detects when Kingdom advances computation state
- Updates local computation records accordingly
- Handles computation cancellation from Kingdom
- Marks computations as completed when Kingdom indicates success
Computation Lifecycle Management
Computation Lifecycle Management
Manages the full lifecycle:
- WAIT_TO_START: Waiting for all participants to confirm
- READY: Ready to begin computation
- RUNNING: Actively computing
- SUCCEEDED/FAILED/CANCELLED: Terminal states
Key Features
Streaming Protocol: The Herald uses gRPC streaming to efficiently monitor computations:- Uses semaphore to limit concurrent computation processing
- Default max concurrency: 5 computations
- Prevents overwhelming the database with parallel writes
- Maintains resumption tokens for streaming
- Enables recovery from network interruptions
- Ensures no computations are missed during reconnection
- Exponential backoff for transient failures
- Maximum retry attempts (default: 5 for streaming, 3 for operations)
- Graceful handling of Kingdom unavailability
Configuration Flags
Protocols Setup Config
The Herald loads protocol configuration that defines:- Supported protocols (LLv2, Reach-Only LLv2, HMSS, TrusTee)
- Duchy’s role in each protocol
- Protocol-specific parameters
- Cryptographic keys and certificates
Blob Storage
Herald needs blob storage access to:- Store initial requisition data locations
- Manage computation artifact paths
- Configure storage prefixes for this duchy
Private Key Storage
For protocols requiring key encryption (e.g., HMSS):Monitoring
Claimed Computations
Rate of new computations claimed from Kingdom
Streaming Reconnects
Frequency of stream interruptions and reconnections
Processing Lag
Time between Kingdom creating computation and Herald claiming it
Error Rate
Failed claim attempts and retry counts
Mill Job Scheduler
Image:duchy/mill-job-schedulerDeployment Name:
{duchy-name}-mill-job-schedulerType: Continuous deployment
Purpose
The Mill Job Scheduler monitors the duchy’s Internal API for computations ready to execute and creates Kubernetes Jobs to run the appropriate mill workers for each computation stage.Responsibilities
Work Polling
Work Polling
Continuously polls for claimable work:
- Queries Internal API for computations in executable states
- Claims work using token-based locking
- Respects work lock durations to prevent duplicate execution
- Polls at configurable intervals (default based on deployment)
Job Creation
Job Creation
Creates Kubernetes Jobs for mill execution:
- Selects appropriate PodTemplate (LLv2, HMSS)
- Generates unique Job name from computation token
- Passes computation details via command-line arguments
- Sets job timeout and retry policies
- Manages job lifecycle (creation, monitoring, cleanup)
Concurrency Management
Concurrency Management
Enforces limits on parallel computations:
- LLv2 maximum concurrency (configurable)
- HMSS maximum concurrency (configurable)
- Prevents resource exhaustion
- Queues work when at capacity
Job Cleanup
Job Cleanup
Removes completed Kubernetes Jobs:
- Deletes successful jobs after completion
- Retains failed jobs for debugging (configurable)
- Prevents Job object accumulation
- Manages Kubernetes API quota
Implementation
The Mill Job Scheduler is implemented in duchy deploy code and uses:- Kubernetes client to create/delete Jobs
- Internal API client to claim work
- PodTemplate references for job definitions
Configuration Flags
Work Lock Duration
The work lock duration determines how long a mill worker has to complete a stage: Too Short: Jobs may not finish before lock expires, causing duplicate workToo Long: Failed jobs hold locks unnecessarily, delaying retries Typical Values:
- Simple stages: 5 minutes
- Complex stages: 15-30 minutes
- Adjust based on data size and cluster resources
PodTemplates
The scheduler references PodTemplates defined in the duchy deployment: LLv2 Mill Template:{duchy}-llv2-millHMSS Mill Template:
{duchy}-hmss-mill
These templates define:
- Container image for mill worker
- Resource requests/limits
- Volume mounts (secrets, config)
- Environment variables
- Restart policy (typically “Never” for Jobs)
Kubernetes Permissions
The Mill Job Scheduler requires RBAC permissions: ServiceAccount:{duchy}-mill-job-scheduler
Role permissions:
- Create Jobs from PodTemplates
- Monitor Job status
- Delete completed Jobs
- Query its own Deployment for configuration
Resource Allocation
The Mill Job Scheduler is lightweight:Monitoring
Jobs Created
Rate of mill job creation per protocol
Queue Depth
Number of computations waiting for capacity
Job Success Rate
Percentage of jobs completing successfully
Lock Contention
Frequency of work already locked by another worker
Computations Cleaner
Image:duchy/computations-cleanerCronJob Name:
{duchy-name}-computations-cleanerSchedule:
0 * * * * (Every hour, on the hour)
Purpose
The Computations Cleaner is a CronJob that removes old computation data from the duchy’s Spanner database to:- Free up database storage
- Maintain query performance
- Remove computations that are no longer needed
- Comply with data retention policies
Operation
Implemented insrc/main/kotlin/org/wfanet/measurement/duchy/service/internal/computations/:
File: ComputationsCleaner.kt
Deletion Strategy
Deletion Strategy
The cleaner:
- Queries for computations older than TTL
- Filters by deletable states (if configured)
- Deletes computation records from Spanner
- Optionally removes associated blob storage
- Logs deletion operations for audit
Configuration Flags
Time to Live (TTL)
Default retention: 180 days Considerations for setting TTL:- Storage costs: Longer retention = higher costs
- Debugging needs: Recent computations useful for troubleshooting
- Compliance: May need to retain for audit purposes
- Coordination: Should align with Kingdom’s completed measurements deletion
Deletable States
The cleaner can be configured via duchy deployment to only delete specific states:- SUCCEEDED
- FAILED
- CANCELLED
Dry Run Mode
Test deletion policies before enabling:- Queries for deletable computations
- Logs what would be deleted
- Does not actually delete anything
- Useful for validating TTL settings
Schedule
Runs every hour at minute 0:- Regular cleanup without excessive database load
- Timely removal of old data
- Manageable batch sizes per run
Network Policy
The cleaner CronJob can only communicate with:- Internal API Server (to delete computations)
Daemon Deployment Patterns
Common Configuration
All daemons share: Secrets Access:- TLS certificates for authentication
- Optional key encryption keys
- Restricted egress to required services only
- No ingress (daemons initiate all connections)
- Health checks
- Optional verbose logging
- Metrics export (when configured)
Reliability
Restart Policies
Restart Policies
Herald & Mill Job Scheduler:
Always- Critical daemons that must stay running
- Kubernetes automatically restarts on failure
- Runs on schedule
- Failures don’t require immediate restart
Graceful Shutdown
Graceful Shutdown
Daemons handle SIGTERM:
- Complete current operation
- Close database connections
- Save continuation tokens
- Exit cleanly
Backoff and Retry
Backoff and Retry
Exponential backoff for:
- Kingdom API failures
- Internal API unavailability
- Network errors
- Transient database errors
Troubleshooting
Herald Not Claiming Work
Check Kingdom connectivity:Mill Jobs Not Starting
Check scheduler logs:Cleaner Not Deleting
Check CronJob status:Best Practices
Herald Configuration
- Set appropriate max concurrency based on database capacity
- Use continuation tokens for stream resumption
- Configure deletable states to match retention policy
- Monitor streaming reconnection frequency
Mill Job Scheduler
- Set work lock duration 2-3x expected stage duration
- Configure max concurrency based on cluster resources
- Monitor job success rates and adjust retry policies
- Clean up old jobs to prevent Kubernetes API overload
Computations Cleaner
- Align TTL with Kingdom’s measurement deletion policy
- Test with dry-run before enabling deletion
- Monitor storage savings from cleanup
- Consider blob storage cleanup separately
Next Steps
Mill Protocols
Learn about cryptographic protocols executed by mills
Duchy Services
Understand duchy API services