Skip to main content
This guide documents best practices for managing Ceph distributed storage systems, with a focus on live network migration procedures that maintain cluster availability and data integrity.
The procedures documented here were performed on production Proxmox-backed Ceph clusters with zero downtime and zero data loss.

Ceph Public Network Migration

This procedure documents how to migrate Ceph traffic from a congested management network to a dedicated Ceph fabric while maintaining full cluster availability.

Migration Overview

Goal: Move all Ceph traffic (MON, MGR, MDS, OSD front + back) to a dedicated network fabric. Example: 172.16.0.0/1610.50.0.0/24 Outcome: No service downtime, no data loss

Key Concepts

public_network

  • Client ↔ OSD traffic
  • MON / MGR control plane
  • CephFS metadata traffic

cluster_network

  • OSD ↔ OSD replication & recovery
  • Data plane for cluster operations

Important Behaviors

  • MON & MGR enforce address validation
  • OSDs bind addresses at restart
  • /etc/pve/ceph.conf is not authoritative alone — Ceph uses its internal config database

Migration Procedure

Step 1: Prepare the New Ceph Network

Create a dedicated bridge on each node:
vim /etc/network/interfaces
auto vmbr-ceph
iface vmbr-ceph inet static
    address 10.50.0.20/24
    bridge-ports eno2
    bridge-stp off
    bridge-fd 0
# Ceph (Fabric)
Assign IPs on the new subnet:
  • pve2 → 10.50.0.20/24
  • pve3 → 10.50.0.30/24
  • pve4 → 10.50.0.40/24
This network is isolated — no gateway required.
Verify connectivity:
ping 10.50.0.30
iperf3 -s  # On one node
iperf3 -c 10.50.0.30  # From another node

Step 2: Add the New Public Network (Dual-Network Phase)

Backup first: cp /etc/pve/ceph.conf /etc/pve/ceph.conf.bak
Edit /etc/pve/ceph.conf:
public_network = 10.50.0.0/24, 172.16.0.0/16
cluster_network = 10.50.0.0/24, 172.16.0.0/16
Do NOT remove the old network yet — this allows gradual migration.
Confirm:
ceph config dump
Verify in Proxmox UI → Ceph → Nodes

Step 3: Recreate MONs (One by One)

1

Destroy First MON

pveceph mon destroy pve2
2

Recreate MON

pveceph mon create
3

Verify Quorum

ceph -s
Ensure quorum is maintained before proceeding
4

Repeat for Each MON

Recreate remaining MONs one at a time, verifying quorum after each
Always ensure quorum after each MON recreation before proceeding to the next.

Step 4: Recreate MGRs (One by One)

Recreate standby managers first, then the active manager last.
pveceph mgr destroy <node>
pveceph mgr create
Verify:
ceph mgr dump
If a manager fails to start:
systemctl reset-failed ceph-mgr@<node>
systemctl start ceph-mgr@<node>

Step 5: Recreate CephFS Metadata Servers (MDS)

MDS binds its address at creation time
pveceph mds destroy <node>
pveceph mds create
✔ Verify CephFS health before proceeding.

Step 6: Remove the Old Public Network

Edit /etc/pve/ceph.conf and remove the old network:
public_network = 10.50.0.0/24
cluster_network = 10.50.0.0/24

Step 7: Recreate MONs, MGRs, and MDS (Again)

This ensures all control-plane daemons bind exclusively to the new network.
1

Recreate MONs

One by one, verifying quorum after each
2

Recreate MGRs

Standbys first, active last
3

Recreate MDS

One by one, verifying CephFS health

Step 8: Protect the Cluster Before Touching OSDs

ceph osd set noout
This prevents Ceph from marking OSDs as out during restart.

Step 9: Restart OSDs (Data Plane Migration)

Restart one OSD at a time:
systemctl restart ceph-osd@<id>
ceph -s
Wait for PGs to return to active+clean before proceeding:
PGs: active+clean
Restart OSDs gradually to avoid overwhelming the cluster with recovery traffic.

Step 10: Remove Protection

ceph osd unset noout

Verification (Critical)

1. Verify Ceph Daemon Addresses

ceph osd metadata <id> | egrep 'front_addr|back_addr'
Expected:
  • front_addr → 10.50.0.x
  • back_addr → 10.50.0.x
  • ❌ No 172.16.x.x

2. Verify Traffic is Using the Ceph Fabric

While Ceph is under load:
ip -s link show vmbr-ceph
RX/TX counters should increase, confirming traffic is not using the management network.

3. Verify Network Performance (iperf3)

iperf3 must be installed on all Ceph nodes
apt install iperf3
Server on one node:
iperf3 -s
Client on a different node:
iperf3 -c <peer_ip> -P 4
Expected for 2.5 GbE Ceph fabric:
  • ~2.1–2.4 Gbit/s
  • Minimal or zero retransmits
  • Stable throughput across multiple streams

Troubleshooting

”OSDs Not Reachable / Wrong Subnet”

Symptom:
osd.X's public address is not in '172.16.x.x/16' subnet
Cause: Ceph config DB or MON/MGR cache still references the old network. Fix:
1

Restart ALL MONs (mandatory)

systemctl restart ceph-mon@pve2
systemctl restart ceph-mon@pve3
systemctl restart ceph-mon@pve4
2

Restart ALL MGRs (mandatory)

systemctl restart ceph-mgr@pve2
systemctl restart ceph-mgr@pve3
systemctl restart ceph-mgr@pve4
3

Clean Config DB (optional)

ceph config rm global public_network
ceph config rm global cluster_network
ceph config set global public_network 10.50.0.0/24
ceph config set global cluster_network 10.50.0.0/24
4

Restart OSDs Again

One by one, verifying PG status after each

Risks Considered

Why This Change is Risky

Changing Ceph cluster networking affects quorum, OSD availability, replication traffic, and client IO. Incorrect sequencing can cause data unavailability or permanent loss.

Failure Modes Considered

  • MON quorum loss
  • OSD flapping
  • Client IO stalls
  • Backfill storms
  • Split-brain conditions

Assumptions

  • Single Ceph cluster
  • Dedicated replication network (fabric)
  • Change executed during low IO window

Final State

Dedicated Fabric

2.5 GbE Ceph-only network

Clean Separation

No Ceph traffic on management NIC

Full Migration

MON / MGR / MDS / OSD all migrated

No Data Loss

Stable cluster, zero downtime

Acknowledgements

This migration approach was heavily informed by the Proxmox forum discussion on Ceph network changes, which provided critical guidance on:
  • Temporarily running dual public networks
  • Recreating MON, MGR, and MDS daemons to force address rebinding
  • Avoiding full cluster downtime during network migration
Source: Proxmox Forum – Ceph: changing public network
Always test network migration procedures in a non-production environment first, and schedule changes during maintenance windows.

Build docs developers (and LLMs) love