Skip to main content
AWX supports multi-node cluster configurations that enable horizontal scaling, high availability, and increased job execution capacity.

Architecture Overview

AWX can be deployed in a clustered configuration with multiple control plane nodes working together to handle API requests and execute jobs.
       ┌───────────────────────────┐
       │      Load-balancer        │
       │   (configured separately) │
       └───┬───────────────────┬───┘
           │   round robin API │
           ▼       requests    ▼

  AWX Control               AWX Control
    Node 1                    Node 2
┌──────────────┐           ┌──────────────┐
│              │           │              │
│ ┌──────────┐ │           │ ┌──────────┐ │
│ │ awx-task │ │           │ │ awx-task │ │
│ ├──────────┤ │           │ ├──────────┤ │
│ │ awx-ee   │ │           │ │ awx-ee   │ │
│ ├──────────┤ │           │ ├──────────┤ │
│ │ awx-web  │ │           │ │ awx-web  │ │
│ ├──────────┤ │           │ ├──────────┤ │
│ │ redis    │ │           │ │ redis    │ │
│ └──────────┘ │           │ └──────────┘ │
│              │           │              │
└──────────────┴─────┬─────┴──────────────┘


               ┌─────▼─────┐
               │ Postgres  │
               │ database  │
               └───────────┘

Deployment Types

There are two main deployment types:

Virtual Machines (VM)

Ansible Automation Platform (AAP) can be installed on VMs with traditional OS-level processes

Kubernetes (K8S)

Both AAP and upstream AWX support K8S deployments with containerized services
The upstream AWX project can only be installed via a K8S deployment. Either deployment type supports cluster scaling.

Control Node Components

VM Deployments

Control plane nodes run background services managed by supervisord:
  • dispatcher - Job scheduling and task management
  • wsbroadcast - WebSocket communication between nodes
  • callback receiver - Ansible callback processing
  • receptor - Mesh networking (managed under systemd)
  • redis - Caching and message broker (managed under systemd)
  • uwsgi - WSGI application server
  • daphne - ASGI server for WebSockets
  • rsyslog - Logging service

Kubernetes Deployments

Background processes are containerized:

awx-ee

receptor

awx-web

uwsgi, daphne, wsbroadcast, rsyslog

awx-task

dispatcher, callback receiver

redis

redis

Monolithic Design

Each control node is monolithic and contains all necessary components for handling API requests and running jobs.
Key Characteristics:
  • Load balancer distributes incoming requests across control nodes
  • All control nodes interact with a single, shared PostgreSQL database
  • If any service fails sufficiently, the entire instance is placed offline automatically for remediation

Scaling the Cluster

AAP Deployments

1

Modify Inventory

Edit the Ansible inventory file to include new nodes
2

Run Setup Script

Execute setup.sh to provision the new nodes
3

Verify Registration

New control plane node is registered in the database as a new Instance

Kubernetes Deployments

kubectl scale deployment awx-web --replicas=5
kubectl scale deployment awx-task --replicas=5
Scaling is handled by changing the number of replicas in the AWX replica set.

Instance Types

Nodes can be configured with different types based on their role:
TypeAAP OnlyDescription
controlNoControl plane node that cannot run jobs
hybridYesControl plane node that can also run jobs
executionNoNot a control node, can only run jobs
hopYesRoutes traffic from control to execution nodes
hybrid and control nodes are identical other than the type indicated in the database. Control-type nodes still have all machinery to run jobs but are disabled through the API. This allows provisioning control nodes with fewer hardware resources.

Communication Between Nodes

Connection Matrix

Node TypeConnection TypePurpose
Control nodewebsockets, receptorSending websockets, heartbeat
ExecutionreceptorSubmitting jobs, heartbeat
Hop (AAP only)receptorRouting traffic to execution nodes
Postgrespostgres TCP/IPRead and write queries, pg notify

Receptor

Receptor provides an overlay network connecting control, execution, and hop nodes. How It Works:
  • Establishes periodic heartbeats between nodes
  • Submits jobs to execution nodes
  • Forms a mesh via persistent TCP/IP connections
  • Routes traffic through intermediate nodes
node A <---TCP---> node B <---TCP---> node C
Node A is reachable from node C (and vice versa) even without a direct connection. Receptor routes traffic through node B.

WebSocket Backplane

Each control node establishes websocket connections to all other control nodes.
┌────────┐
│        │
│browser │
│        │
└───┬────┘
    │ websocket connection

┌───▼─────┐            ┌─────────┐
│ control │            │ control │
│ node A  │◄───────────┤ node B  │
└─────────┘  websocket └─────────┘
             connection
             (job event)
Purpose:
  • Stream real-time data to UI (job events, logs)
  • Load balancer determines which control node browsers connect to
  • Control nodes broadcast messages to all other nodes
  • Ensures users see real-time updates regardless of which node generates them
The websocket backplane is handled by the wsbroadcast service that starts with the application.

PostgreSQL

AWX uses psycopg3 to connect to PostgreSQL:
  • Only control nodes need direct database access
  • Uses pg_notify for inter-process communication
  • Enables dispatcher system to coordinate parallel processes
  • Task manager communicates with main dispatcher thread via notifications

Node Health Management

Node health is determined by the cluster_node_heartbeat periodic task running on each control node.

Heartbeat Process

1

Get Instance List

Retrieve all instances registered in the database
2

Inspect Execution Nodes

  • Acquire DB advisory lock (single control node inspects at a time)
  • Set last_seen based on Receptor heartbeat
  • Gather node info via receptorctl status
  • Run execution_node_health_check
  • Execute ansible-runner --worker-info to get CPU, memory, version
  • Calculate capacity for the instance
3

Detect Lost Nodes

  • Calculate grace period: CLUSTER_NODE_HEARTBEAT_PERIOD * CLUSTER_NODE_MISSED_HEARTBEAT_TOLERANCE
  • Mark instances as lost if last_seen exceeds grace period
4

Check Local Health

  • Determine if current node is lost
  • Call get_cpu_count and get_mem_in_bytes from ansible-runner
5

Register New Instance

If current instance not found in database, register it
6

Version Comparison

  • Compare current node’s ansible-runner version with others
  • If older, call stop_local_services and shut down
7

Handle Lost Instances

  • Reap running, pending, and waiting jobs (mark as failed)
  • Delete instance from database
8

Reap Local Jobs

Clean up jobs not actively processed by dispatcher workers

Instance Groups

Instances can be organized into Instance Groups for workload management and resource allocation.

Creating Instance Groups

System Administrators can create Instance Groups:
curl -X POST https://awx.example.com/api/v2/instance_groups/ \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Workers",
    "policy_instance_percentage": 50
  }'

Associating Instances

Add instances to groups:
curl -X POST https://awx.example.com/api/v2/instance_groups/x/instances/ \
  -H "Content-Type: application/json" \
  -d '{"id": y}'
Instances automatically reconfigure to listen on the group’s work queue when added.

Instance Group Policies

Policies determine automatic instance assignment to groups:

Policy Fields

policy_instance_percentage

Percentage (0-100) of active instances to assign to this group

policy_instance_minimum

Minimum number of instances to maintain in the group

policy_instance_list

Fixed list of instance names to always include

Policy Behavior

Percentage + Minimum Work Together: If you have 50% percentage and minimum of 2:
  • With 6 instances → 3 assigned to group
  • With 2 instances → 2 assigned (meets minimum)
  • With 1 instance → 1 assigned (can’t meet minimum)
Preventing Overlap: Make percentages sum to 100 across groups:
  • 4 instance groups with 25% each
  • Instances distributed with no overlap

Manually Pinning Instances

To exclusively assign an instance to specific groups:
# Add to policy list
curl -X PATCH https://awx.example.com/api/v2/instance_groups/N/ \
  -d '{
    "policy_instance_list": ["special-instance"]
  }'

# Disable policy management
curl -X PATCH https://awx.example.com/api/v2/instances/X/ \
  -d '{
    "managed_by_policy": false
  }'
Instances with managed_by_policy: false will only belong to groups in their policy_instance_list.

Job Runtime Behavior

When a job is submitted:
  1. Pushed into dispatcher queue via postgres notify/listen
  2. Handled by dispatcher process on a specific AWX node
  3. If instance fails during job execution, work is marked as permanently failed

Instance Group Job Assignment

If cluster has separate Instance Groups:
  • Any instance in the group can receive jobs
  • Capacity reduced from all groups an instance belongs to
  • Provisioning instances expands work capacity
  • De-provisioning removes capacity
If all instances in an Instance Group are offline, jobs targeting only that group will wait until instances become available.

Controlling Job Placement

Default Behavior

Jobs are submitted to:
  • Default queue: For regular jobs (see DEFAULT_EXECUTION_QUEUE_NAME)
  • Control plane queue: For administrative actions like project updates (see DEFAULT_CONTROL_PLANE_QUEUE_NAME)

Restricting Job Placement

Instance Groups can be associated with:
  1. Job Template (highest priority)
  2. Inventory (medium priority)
  3. Organization (lowest priority, via Inventory)
If all associated instance groups are at capacity, jobs remain in pending state until capacity frees up.

Preferred Instance Group Order

AWX checks in this order:
  1. Job Template instance groups
  2. Inventory instance groups (if template groups at capacity)
  3. Organization instance groups (if inventory groups at capacity)
The global instance group can be associated alongside custom groups as a fallback.

Project Synchronization

Project syncs run on the instance that prepares the ansible-runner private data directory. Sync Behavior:
  • Performed by dispatcher control/launch process
  • Updates source tree to correct version immediately before job transmission
  • Skipped if correct revision already checked out and no Galaxy/Collections updates needed
  • Recorded as project update with launch_type: sync and job_type: run
  • Does not change project status or version (except for “never updated” projects)
  • Runs with container isolation, volume mounts to persistent projects folder

Instance Enable/Disable

Temporarily take instances offline:
curl -X PATCH https://awx.example.com/api/v2/instances/X/ \
  -d '{"enabled": false}'
When disabled:
  • No new jobs assigned to the instance
  • Existing jobs finish normally
  • Useful for maintenance without terminating running jobs

Status and Monitoring

Cluster Health Endpoint

curl https://awx.example.com/api/v2/ping/
Returns:
  • Instance servicing the HTTP request
  • Last heartbeat time of all other instances
  • Instance Groups and membership

Detailed Views

Instances

/api/v2/instances/ - View instance details and running jobs

Instance Groups

/api/v2/instance_groups/ - View groups and membership

Best Practices

Load Balancer

Configure proper health checks and session affinity for WebSocket connections

Database Performance

Use dedicated PostgreSQL instance with appropriate resources and tuning

Network Reliability

Ensure stable, low-latency connections between cluster nodes

Capacity Planning

Monitor capacity and scale before reaching limits

Backup Strategy

Regular database backups are critical in clustered environments

Version Consistency

Keep all nodes on the same AWX version to prevent automatic shutdowns

Build docs developers (and LLMs) love