The procedures documented here were performed on production Proxmox-backed Ceph clusters with zero downtime and zero data loss.
Ceph Public Network Migration
This procedure documents how to migrate Ceph traffic from a congested management network to a dedicated Ceph fabric while maintaining full cluster availability.Migration Overview
Goal: Move all Ceph traffic (MON, MGR, MDS, OSD front + back) to a dedicated network fabric. Example:172.16.0.0/16 → 10.50.0.0/24
Outcome: No service downtime, no data loss
Key Concepts
public_network
- Client ↔ OSD traffic
- MON / MGR control plane
- CephFS metadata traffic
cluster_network
- OSD ↔ OSD replication & recovery
- Data plane for cluster operations
Important Behaviors
Migration Procedure
Step 1: Prepare the New Ceph Network
Create a dedicated bridge on each node:pve2 → 10.50.0.20/24pve3 → 10.50.0.30/24pve4 → 10.50.0.40/24
This network is isolated — no gateway required.
Step 2: Add the New Public Network (Dual-Network Phase)
Edit/etc/pve/ceph.conf:
Do NOT remove the old network yet — this allows gradual migration.
Step 3: Recreate MONs (One by One)
Step 4: Recreate MGRs (One by One)
Recreate standby managers first, then the active manager last.
Recovery Tip: MGR Fails to Start
Recovery Tip: MGR Fails to Start
If a manager fails to start:
Step 5: Recreate CephFS Metadata Servers (MDS)
MDS binds its address at creation time
Step 6: Remove the Old Public Network
Edit/etc/pve/ceph.conf and remove the old network:
Step 7: Recreate MONs, MGRs, and MDS (Again)
This ensures all control-plane daemons bind exclusively to the new network.Step 8: Protect the Cluster Before Touching OSDs
Step 9: Restart OSDs (Data Plane Migration)
Restart one OSD at a time:active+clean before proceeding:
Step 10: Remove Protection
Verification (Critical)
1. Verify Ceph Daemon Addresses
- ✅
front_addr → 10.50.0.x - ✅
back_addr → 10.50.0.x - ❌ No
172.16.x.x
2. Verify Traffic is Using the Ceph Fabric
While Ceph is under load:3. Verify Network Performance (iperf3)
- ~2.1–2.4 Gbit/s
- Minimal or zero retransmits
- Stable throughput across multiple streams
Troubleshooting
”OSDs Not Reachable / Wrong Subnet”
Symptom:Risks Considered
Why This Change is Risky
Failure Modes Considered
- MON quorum loss
- OSD flapping
- Client IO stalls
- Backfill storms
- Split-brain conditions
Assumptions
- Single Ceph cluster
- Dedicated replication network (fabric)
- Change executed during low IO window
Final State
Dedicated Fabric
2.5 GbE Ceph-only network
Clean Separation
No Ceph traffic on management NIC
Full Migration
MON / MGR / MDS / OSD all migrated
No Data Loss
Stable cluster, zero downtime
Acknowledgements
This migration approach was heavily informed by the Proxmox forum discussion on Ceph network changes, which provided critical guidance on:- Temporarily running dual public networks
- Recreating MON, MGR, and MDS daemons to force address rebinding
- Avoiding full cluster downtime during network migration