System Components
The task management system consists of three separate components:- Dependency Manager
- Task Manager
- Workflow Manager
Scheduling Considerations
When choosing a task to run, the system considers:- Creation time: Earlier tasks are prioritized
- Job dependencies: Dependent tasks wait for prerequisites
- Capacity: Available resources on execution nodes
Dependency Manager
Purpose
Responsible for looking at each pending task and determining whether it should create a dependency for that task.Example: Update on Launch
Ifscm_update_on_launch is enabled for a project, a project update will be created as a dependency when a job using that project is launched.
Dependency Chain
Dependencies can have their own dependencies:Dependency Manager Steps
-
Get pending tasks (parent tasks) that have
dependencies_processed = False -
Cache related objects as optimization:
- Related projects
- Related inventory sources
-
Create dependencies when needed:
- Project or inventory update not already created
- Last update failed
- Last update outside cache timeout window
- Additional logic for inventory updates
-
Link dependencies to parent task:
- Use
dependent_jobsfield - Allows canceling parent if dependency fails
- Use
-
Mark dependencies processed:
- Update parent tasks with
dependencies_processed = True
- Update parent tasks with
-
Check nested dependencies:
- Inventory source updates can have project update dependencies
Update on Launch Logic
Projects and inventory sources marked as “update on launch” trigger updates when related job templates are launched. Rules:- Update triggered when related job template is launched
- Update not triggered if:
- Recent update exists
- Last update finished successfully
- Finished time within configured cache window
- Failed updates always trigger new update
update on launchjobs havelaunch_typeofdependent- If dependent job fails, related jobs also fail
Task Manager
Purpose
Responsible for looking at each pending task and determining whether Task Manager can start that task.Task Manager Steps
-
Get tasks that have
dependencies_processed = True:- Pending tasks
- Waiting tasks
- Running tasks
-
Process running tasks first:
- Build dependency graph
- Account for currently consumed capacity
- Track capacity in-memory:
TaskManagerInstances: Instance capacity trackingTaskManagerInstanceGroups: Group capacity tracking
-
For each pending task:
- Check if total tasks started this cycle >
start_task_limit - Check if task has timed out
- Check if task is blocked (by dependencies or concurrency rules)
- Check if preferred instances have enough capacity
- Check if total tasks started this cycle >
-
Start the task:
- Change status to
waiting - Submit task to dispatcher (via pg_notify)
- Change status to
Blocking Logic
Hard blocking: Database-backed viadependent_jobs field
- Job A will not run if any of its
dependent_jobsare still running - Represented in database
- No database representation
- Example: Job A and Job B based on same template with
allow_simultaneousdisabled - Job B blocked if Job A is running
- Determined via Dependency Graph
Task Manager Rules
- Groups of blocked tasks run in chronological order
- Tasks run when capacity available (one job always allowed per instance group)
- Only one Project Update per Project at a time
- Only one Inventory Update per Inventory Source at a time
- Only one Job per Job Template at a time (unless
allow_simultaneousis enabled) - Only one System Job at a time
Node Affinity Decider
The Task Manager decides which exact node a job will run on. Decision process:- Construct set of groups where job can run
- Consider user-configured group execution policy
- Consider user-configured capacity
- Traverse groups to find suitable node
- First choice: Node with largest remaining capacity that can fit the job
- Fallback: Largest idle node, even if job exceeds capacity
- This allows instances to exceed capacity limits when necessary
Workflow Manager
Purpose
Responsible for looking at each workflow job and determining if the next node can run.Workflow Manager Steps
- Get all running workflow jobs
-
Build workflow DAG for each workflow job:
- Directed Acyclic Graph of workflow nodes
- Represents workflow structure
-
For each workflow job:
- Check if timed out
- Check if next node can start based on:
- Previous node status
- Success/failure/always logic
- Convergence rules
-
Create and start new tasks:
- Create task for next workflow node
- Signal start
Workflow Execution
Workflows execute based on node relationships:System Architecture
Entry Point: schedule()
Each manager has a single entry point:schedule().
Locking mechanism:
- Attempts to acquire single, global lock in database
- If lock cannot be acquired, method returns
- Lock indicates another instance is currently running
Atomic Transactions
Each manager runs inside an atomic DB transaction:- If dispatcher task is killed, no partial updates
- All-or-nothing execution
- Consistency guaranteed
Hybrid Scheduler: Periodic + Event
Managers run in two ways: a) Periodically: Background task (every 30 seconds by default) b) Event-triggered: On job creation or completionWorkflow Manager doesn’t run directly on a schedule - it piggy-backs off Task Manager. If Task Manager sees running workflow jobs, it schedules Workflow Manager.
- Reduces latency: Jobs start faster with event-triggered execution
- Fail-safe: Periodic execution catches missed events
- Resilience: System progresses even if events are missed
Bulk Reschedule
Utility classes prevent scheduling too many managers:ScheduleTaskManager.schedule() ensures only one Task Manager is scheduled after all tasks are processed, not one per task.
Timing Out
Because of the global lock, only one manager can run at a time. Timeout protection:- Parent dispatcher process will SIGKILL stuck managers
- Timeout after a few minutes
- Allows new manager to take over
- Manager runs in transaction, so SIGKILL rolls back changes
- Next run re-processes same tasks
- Risk: Manager never progresses (times out every cycle)
- Solution: Manager checks time and bails out early if near timeout
- Commits partial progress before timeout
- Next cycle continues from where previous left off
Job Lifecycle Detail
Status Transitions
Status Meanings
| Status | State |
|---|---|
| pending | Job launched, but: 1. Not yet seen by scheduler 2. Blocked by another task 3. Not enough capacity |
| waiting | Job submitted to dispatcher via pg_notify |
| running | Job is running on an AWX node |
| successful | Job finished with return code 0 |
| failed | Job finished with return code ≠ 0 |
| error | System failure |
| canceled | Manually canceled by user |
Capacity Calculation
Instance Capacity
Each instance has:- Total capacity: Configured or calculated from resources
- Consumed capacity: Sum of running job impacts
- Remaining capacity: Total - Consumed
Job Impact
Jobs consume capacity based on:- Forks: Higher forks = higher impact
- Job type: Some jobs have fixed impact (e.g., system jobs = 5)
Special Capacity Rule
Managers Are Short-Lived
Manager instances are ephemeral:- Created: New instance on each run
- Load data: Pull relevant data from database
- Process: Execute scheduling logic
- Cleanup: Instance destroyed
- No stale state
- Fresh data every cycle
- No memory leaks from long-running processes
Debugging the Task Manager
Checking Task Status
Forcing Task Manager Run
Checking Capacity
Common Issues
Jobs stuck in pending:- Check if dependencies are satisfied
- Check capacity on instance groups
- Check for blocking jobs (concurrent jobs disabled)
- Verify task manager is running
- Check dispatcher is running:
awx-manage dispatcherctl status - Check for errors in logs:
/var/log/tower/ - Verify database connectivity
Performance Tuning
start_task_limit
Limits tasks started per Task Manager cycle:Task Manager Period
How often Task Manager runs:Database Indexes
Ensure indexes exist on:statusfielddependencies_processedfieldcreatedtimestamp