Clustering

AWX supports multi-node cluster configurations that enable horizontal scaling, high availability, and increased job execution capacity.

Architecture Overview

AWX can be deployed in a clustered configuration with multiple control plane nodes working together to handle API requests and execute jobs.

       ┌───────────────────────────┐
       │      Load-balancer        │
       │   (configured separately) │
       └───┬───────────────────┬───┘
           │   round robin API │
           ▼       requests    ▼

  AWX Control               AWX Control
    Node 1                    Node 2
┌──────────────┐           ┌──────────────┐
│              │           │              │
│ ┌──────────┐ │           │ ┌──────────┐ │
│ │ awx-task │ │           │ │ awx-task │ │
│ ├──────────┤ │           │ ├──────────┤ │
│ │ awx-ee   │ │           │ │ awx-ee   │ │
│ ├──────────┤ │           │ ├──────────┤ │
│ │ awx-web  │ │           │ │ awx-web  │ │
│ ├──────────┤ │           │ ├──────────┤ │
│ │ redis    │ │           │ │ redis    │ │
│ └──────────┘ │           │ └──────────┘ │
│              │           │              │
└──────────────┴─────┬─────┴──────────────┘
                     │
                     │
               ┌─────▼─────┐
               │ Postgres  │
               │ database  │
               └───────────┘

Deployment Types

There are two main deployment types:

Virtual Machines (VM)

Ansible Automation Platform (AAP) can be installed on VMs with traditional OS-level processes

Kubernetes (K8S)

Both AAP and upstream AWX support K8S deployments with containerized services

The upstream AWX project can only be installed via a K8S deployment. Either deployment type supports cluster scaling.

Control Node Components

VM Deployments

Control plane nodes run background services managed by supervisord:

dispatcher - Job scheduling and task management
wsbroadcast - WebSocket communication between nodes
callback receiver - Ansible callback processing
receptor - Mesh networking (managed under systemd)
redis - Caching and message broker (managed under systemd)
uwsgi - WSGI application server
daphne - ASGI server for WebSockets
rsyslog - Logging service

Kubernetes Deployments

Background processes are containerized:

awx-ee

receptor

awx-web

uwsgi, daphne, wsbroadcast, rsyslog

awx-task

dispatcher, callback receiver

redis

Monolithic Design

Each control node is monolithic and contains all necessary components for handling API requests and running jobs.

Key Characteristics:

Load balancer distributes incoming requests across control nodes
All control nodes interact with a single, shared PostgreSQL database
If any service fails sufficiently, the entire instance is placed offline automatically for remediation

Scaling the Cluster

AAP Deployments

Modify Inventory

Edit the Ansible inventory file to include new nodes

Run Setup Script

Execute setup.sh to provision the new nodes

Verify Registration

New control plane node is registered in the database as a new Instance

Kubernetes Deployments

kubectl scale deployment awx-web --replicas=5
kubectl scale deployment awx-task --replicas=5

Scaling is handled by changing the number of replicas in the AWX replica set.

Instance Types

Nodes can be configured with different types based on their role:

Type	AAP Only	Description
control	No	Control plane node that cannot run jobs
hybrid	Yes	Control plane node that can also run jobs
execution	No	Not a control node, can only run jobs
hop	Yes	Routes traffic from control to execution nodes

hybrid and control nodes are identical other than the type indicated in the database. Control-type nodes still have all machinery to run jobs but are disabled through the API. This allows provisioning control nodes with fewer hardware resources.

Communication Between Nodes

Connection Matrix

Node Type	Connection Type	Purpose
Control node	websockets, receptor	Sending websockets, heartbeat
Execution	receptor	Submitting jobs, heartbeat
Hop (AAP only)	receptor	Routing traffic to execution nodes
Postgres	postgres TCP/IP	Read and write queries, pg notify

Receptor

Receptor provides an overlay network connecting control, execution, and hop nodes. How It Works:

Establishes periodic heartbeats between nodes
Submits jobs to execution nodes
Forms a mesh via persistent TCP/IP connections
Routes traffic through intermediate nodes

node A <---TCP---> node B <---TCP---> node C

Node A is reachable from node C (and vice versa) even without a direct connection. Receptor routes traffic through node B.

WebSocket Backplane

Each control node establishes websocket connections to all other control nodes.

┌────────┐
│        │
│browser │
│        │
└───┬────┘
    │ websocket connection
    │
┌───▼─────┐            ┌─────────┐
│ control │            │ control │
│ node A  │◄───────────┤ node B  │
└─────────┘  websocket └─────────┘
             connection
             (job event)

Purpose:

Stream real-time data to UI (job events, logs)
Load balancer determines which control node browsers connect to
Control nodes broadcast messages to all other nodes
Ensures users see real-time updates regardless of which node generates them

The websocket backplane is handled by the wsbroadcast service that starts with the application.

PostgreSQL

AWX uses psycopg3 to connect to PostgreSQL:

Only control nodes need direct database access
Uses pg_notify for inter-process communication
Enables dispatcher system to coordinate parallel processes
Task manager communicates with main dispatcher thread via notifications

Node Health Management

Node health is determined by the cluster_node_heartbeat periodic task running on each control node.

Heartbeat Process

Get Instance List

Retrieve all instances registered in the database

Inspect Execution Nodes

Acquire DB advisory lock (single control node inspects at a time)
Set last_seen based on Receptor heartbeat
Gather node info via receptorctl status
Run execution_node_health_check
Execute ansible-runner --worker-info to get CPU, memory, version
Calculate capacity for the instance

Detect Lost Nodes

Calculate grace period: CLUSTER_NODE_HEARTBEAT_PERIOD * CLUSTER_NODE_MISSED_HEARTBEAT_TOLERANCE
Mark instances as lost if last_seen exceeds grace period

Check Local Health

Determine if current node is lost
Call get_cpu_count and get_mem_in_bytes from ansible-runner

If current instance not found in database, register it

Version Comparison

Compare current node’s ansible-runner version with others
If older, call stop_local_services and shut down

Handle Lost Instances

Reap running, pending, and waiting jobs (mark as failed)
Delete instance from database

Reap Local Jobs

Clean up jobs not actively processed by dispatcher workers

Instance Groups

Instances can be organized into Instance Groups for workload management and resource allocation.

Creating Instance Groups

System Administrators can create Instance Groups:

curl -X POST https://awx.example.com/api/v2/instance_groups/ \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production Workers",
    "policy_instance_percentage": 50
  }'

Associating Instances

Add instances to groups:

curl -X POST https://awx.example.com/api/v2/instance_groups/x/instances/ \
  -H "Content-Type: application/json" \
  -d '{"id": y}'

Instances automatically reconfigure to listen on the group’s work queue when added.

Instance Group Policies

Policies determine automatic instance assignment to groups:

Policy Fields

policy_instance_percentage

Percentage (0-100) of active instances to assign to this group

policy_instance_minimum

Minimum number of instances to maintain in the group

policy_instance_list

Fixed list of instance names to always include

Policy Behavior

Percentage + Minimum Work Together: If you have 50% percentage and minimum of 2:

With 6 instances → 3 assigned to group
With 2 instances → 2 assigned (meets minimum)
With 1 instance → 1 assigned (can’t meet minimum)

Preventing Overlap: Make percentages sum to 100 across groups:

4 instance groups with 25% each
Instances distributed with no overlap

Manually Pinning Instances

To exclusively assign an instance to specific groups:

# Add to policy list
curl -X PATCH https://awx.example.com/api/v2/instance_groups/N/ \
  -d '{
    "policy_instance_list": ["special-instance"]
  }'

# Disable policy management
curl -X PATCH https://awx.example.com/api/v2/instances/X/ \
  -d '{
    "managed_by_policy": false
  }'

Instances with managed_by_policy: false will only belong to groups in their policy_instance_list.

Job Runtime Behavior

When a job is submitted:

Pushed into dispatcher queue via postgres notify/listen
Handled by dispatcher process on a specific AWX node
If instance fails during job execution, work is marked as permanently failed

Instance Group Job Assignment

If cluster has separate Instance Groups:

Any instance in the group can receive jobs
Capacity reduced from all groups an instance belongs to
Provisioning instances expands work capacity
De-provisioning removes capacity

If all instances in an Instance Group are offline, jobs targeting only that group will wait until instances become available.

Controlling Job Placement

Default Behavior

Jobs are submitted to:

Default queue: For regular jobs (see DEFAULT_EXECUTION_QUEUE_NAME)
Control plane queue: For administrative actions like project updates (see DEFAULT_CONTROL_PLANE_QUEUE_NAME)

Restricting Job Placement

Instance Groups can be associated with:

Job Template (highest priority)
Inventory (medium priority)
Organization (lowest priority, via Inventory)

If all associated instance groups are at capacity, jobs remain in pending state until capacity frees up.

Preferred Instance Group Order

AWX checks in this order:

Job Template instance groups
Inventory instance groups (if template groups at capacity)
Organization instance groups (if inventory groups at capacity)

The global instance group can be associated alongside custom groups as a fallback.

Project Synchronization

Project syncs run on the instance that prepares the ansible-runner private data directory. Sync Behavior:

Performed by dispatcher control/launch process
Updates source tree to correct version immediately before job transmission
Skipped if correct revision already checked out and no Galaxy/Collections updates needed
Recorded as project update with launch_type: sync and job_type: run
Does not change project status or version (except for “never updated” projects)
Runs with container isolation, volume mounts to persistent projects folder

Instance Enable/Disable

Temporarily take instances offline:

curl -X PATCH https://awx.example.com/api/v2/instances/X/ \
  -d '{"enabled": false}'

When disabled:

No new jobs assigned to the instance
Existing jobs finish normally
Useful for maintenance without terminating running jobs

Status and Monitoring

Cluster Health Endpoint

curl https://awx.example.com/api/v2/ping/

Returns:

Instance servicing the HTTP request
Last heartbeat time of all other instances
Instance Groups and membership

Detailed Views

Instances

/api/v2/instances/ - View instance details and running jobs

Instance Groups

/api/v2/instance_groups/ - View groups and membership

Best Practices

Load Balancer

Configure proper health checks and session affinity for WebSocket connections

Database Performance

Use dedicated PostgreSQL instance with appropriate resources and tuning

Network Reliability

Ensure stable, low-latency connections between cluster nodes

Capacity Planning

Monitor capacity and scale before reaching limits

Backup Strategy

Regular database backups are critical in clustered environments

Version Consistency

Keep all nodes on the same AWX version to prevent automatic shutdowns

Overview

Installation

Core Concepts

User Guide

Advanced Features

Integrations

Administration

​Architecture Overview

​Deployment Types

Virtual Machines (VM)

Kubernetes (K8S)

​Control Node Components

​VM Deployments

​Kubernetes Deployments

awx-ee

awx-web

awx-task

redis

​Monolithic Design

​Scaling the Cluster

​AAP Deployments

​Kubernetes Deployments

​Instance Types

​Communication Between Nodes

​Connection Matrix

​Receptor

​WebSocket Backplane

​PostgreSQL

​Node Health Management

​Heartbeat Process

​Instance Groups

​Creating Instance Groups

​Associating Instances

​Instance Group Policies

​Policy Fields

policy_instance_percentage

policy_instance_minimum

policy_instance_list

​Policy Behavior

​Manually Pinning Instances

​Job Runtime Behavior

​Instance Group Job Assignment

​Controlling Job Placement

​Default Behavior

​Restricting Job Placement

​Preferred Instance Group Order

​Project Synchronization

​Instance Enable/Disable

​Status and Monitoring

​Cluster Health Endpoint

​Detailed Views

Instances

Instance Groups

​Best Practices

Load Balancer

Database Performance

Network Reliability

Capacity Planning

Backup Strategy

Version Consistency

Build docs developers (and LLMs) love

Architecture Overview

Deployment Types

Control Node Components

VM Deployments

Kubernetes Deployments

Monolithic Design

Scaling the Cluster

AAP Deployments

Kubernetes Deployments

Instance Types

Communication Between Nodes

Connection Matrix

Receptor

WebSocket Backplane

PostgreSQL

Node Health Management

Heartbeat Process

Instance Groups

Creating Instance Groups

Associating Instances

Instance Group Policies

Policy Fields

Policy Behavior

Manually Pinning Instances

Job Runtime Behavior

Instance Group Job Assignment

Controlling Job Placement

Default Behavior

Restricting Job Placement

Preferred Instance Group Order

Project Synchronization

Instance Enable/Disable

Status and Monitoring

Cluster Health Endpoint

Detailed Views

Best Practices