- What was decided
- What alternatives were considered
- Why the choice was made
Hardware Decisions
NAS Motherboard — ASUS TUF Gaming Z690-Plus WiFi D4
Decision: Replace original Gigabyte B760M DS3H DDR4 with ASUS TUF Gaming Z690-Plus WiFi D4. Alternatives Considered:- Gigabyte B760M DS3H DDR4 (original plan — already owned)
- MSI PRO B660M-A DDR4 (found in IT closet, spare)
- ASUS TUF Gaming Z690-Plus WiFi D4 (purchased used ~$100 on Reddit r/hardwareswap)
NAS 10GbE NIC — Dell Intel X710-DA2
Decision: Dell Intel X710-DA2 dual-port SFP+ NIC (~$25 eBay). Alternatives Considered:- Intel X520 (older, very common in homelabs)
- Dell Intel X710-DA2 (newer chipset, dual port)
NAS CPU — i5-13400 over i5-13600
Decision: Keep i5-13400 for the NAS, sell the i5-13600. Rationale: Unraid is IO-bound, not compute-bound. The performance delta between the two chips is irrelevant for a NAS workload. Selling the i5-13600 partially offsets the MS-A2 cost. The i5-13400 also has Intel UHD 730 iGPU for Plex QuickSync hardware transcoding — which is the only compute-heavy task the NAS does.NAS RAM — 32GB, sold extra sticks
Decision: Keep 32GB DDR4 for the NAS, sell the extra 2x 16GB sticks. Rationale: No heavy VM workloads run on Unraid — it’s a pure NAS with one Plex Docker container. 32GB is generous for this role. Selling the extra sticks recoups cost.Primary Compute — Minisforum MS-A2
Decision: Minisforum MS-A2 (Ryzen 9 7945HX, 64GB DDR5) as primary Proxmox node. Rationale: 16-core/32-thread CPU with strong single-thread performance handles multiple concurrent VMs and LXCs without breaking a sweat. Dual 10GbE SFP+ built-in eliminates need for a separate NIC for storage link. Compact form factor fits on a 2U shelf in the rack. 64GB DDR5 is sufficient for all planned workloads with headroom; max is 96GB.MS-A2 Boot — ZFS RAID-1 Mirror
Decision: Two NVMe drives in ZFS RAID-1 mirror configured in the Proxmox installer. Rationale: Single NVMe boot was the primary fragility in v2 — one drive failure killed the entire hypervisor. Mirrored boot means a single drive can fail and Proxmox keeps running. Selected Proxmox installer ZFS RAID-1 option at install time; no additional configuration needed.Secondary Compute — Dell Optiplex 3070 Micro
Decision: Use existing Optiplex 3070 Micro as pve-prod-02 rather than building a second full Proxmox host from spare parts. Alternatives Considered:- Building a second full node using spare i5-13600, spare mobo, and extra 32GB RAM
- Using the Optiplex as-is (free from work)
UPS — First Hardware Purchased
Decision: Tripp-Lite SMART1500LCDXL 1500VA/900W (~$145). First hardware purchased. Rationale: UPS was a non-negotiable requirement before powering any spinning-rust drives — power loss during a write on HDDs is a data corruption risk. USB interface enables NUT integration for fully automated graceful shutdown. 1500VA handles the full stack (5 devices) with capacity to spare. Purchased before any other hardware as a hard rule.Networking Decisions
VLAN Design — 4 VLANs
Decision: VLAN 10 (Management), VLAN 20 (Trusted), VLAN 30 (Services), VLAN 40 (IoT). Rationale: Clean separation of concerns. Management VLAN isolates infrastructure devices (Proxmox hosts, NAS, switches) from everything else. Trusted is personal devices. Services is all v3 VMs and LXCs. IoT is fully isolated with internet-only access. Default VLAN 1 intentionally not used as management — devices should never accidentally land on the management VLAN.No HA (High Availability)
Decision: Proxmox cluster without HA enabled. Rationale: HA requires matched hardware and a 3+ node quorum to be meaningful. The Optiplex cannot absorb the MS-A2’s workload. Enabling HA without these conditions is false security — it creates the illusion of resilience without delivering it. Clustering is enabled for unified management UI only. If pve-prod-01 goes down, services on it are down until it comes back — that is acceptable for a homelab.QDevice on Pi
Decision: Run Proxmox QDevice on pi-prod-01 as cluster tiebreaker. Rationale: A 2-node Proxmox cluster without a quorum device can suffer split-brain — both nodes think they’re the primary and fence each other. QDevice on the Pi costs nothing (10-minute setup) and prevents this. The Pi is lightweight, always-on, and perfectly suited for this role.Reverse Proxy — Traefik (replaces NPM)
Decision: Migrate from Nginx Proxy Manager to Traefik in v3. Rationale: Traefik is the default ingress controller in k3s — learning it now in a Docker context pays dividends when Kubernetes is introduced in Phase 6. Docker label-based routing eliminates per-service config files. Wildcard cert via Cloudflare DNS-01 challenge covers all internal services with one cert. NPM kept running in parallel during migration until all services are confirmed on Traefik — rollback is an AdGuard DNS rewrite flip.External Access — Cloudflare Tunnel + Plex Port Forward
Decision: CF Tunnel for all externally exposed services except Plex. Plex gets a direct port forward on 32400. Rationale: Cloudflare ToS prohibits video streaming through their tunnel — so Plex cannot use it. Plex handles its own TLS and relay natively, so a direct port forward is simple and proven. Everything else (Authentik, ABS, Seerr, Shelfmark) goes through CF Tunnel with no port forwarding required. Pangolin evaluated and rejected: VPS relay adds latency for Plex streaming, CF Tunnel is free and works, no meaningful reason to add VPS cost.Storage Decisions
NAS Platform — Unraid 7.2.4 (replaces TrueNAS SCALE)
Decision: Unraid as the NAS OS. Rationale: Hybrid ZFS + parity array model fits the data risk tolerance perfectly. ZFS used where data integrity is non-negotiable (photos, backups). Parity array used for recoverable bulk media where mixed drive sizes and expandability matter more than ZFS guarantees. TrueNAS ran as a VM in v2 — v3 gives it a dedicated host as it should have always had.No Cache Pool at Launch
Decision: No NVMe cache pool installed at launch. Rationale: Downloads must bypass cache entirely — downloads and media must live on the same filesystem for hardlinks and atomic moves to work. If downloads go through cache and media lives on the array, hardlinks break silently and cause file duplication. Container appdata lives on the docker-prod-01 local disk, not NAS. Plex transcode temp points at a local Unraid directory. No real workload justifies cache at launch. If a use case emerges: add 2x 512GB NVMe as BTRFS RAID-1 mirrored cache pool. Cache drives must always be mirrored — a single cache SSD failure before the mover runs means data loss.Downloads Share — Parity Array Direct, No Cache
Decision: Downloads share set to “Use cache: No” in Unraid. Writes go directly to parity array. Rationale: Hard requirement for hardlinks. Downloads and media must be on the same filesystem. Cache involvement breaks this. Non-negotiable — violations cause silent file duplication and broken seeding.Dual Parity on Array
Decision: 2x WD Red Pro 12TB as parity drives (dual parity). Rationale: With 3+ data drives, dual parity is worth the cost. Single parity means two simultaneous drive failures (one during a rebuild) kills the array. Dual parity tolerates two simultaneous failures.SkyHawk and Barracuda Drives — Not in Unraid Array
Decision: 4x Seagate SkyHawk 6TB go to Synology for cold backup storage only. 1x Seagate Barracuda 4TB retired. Rationale: SkyHawk drives use surveillance firmware — not appropriate for a NAS parity array. Barracuda is a desktop drive not rated for always-on NAS duty. Neither belongs in a production array.UID/GID — 2000:2000 for All Services
Decision: All containers run as PUID=2000, PGID=2000. Clean break from v2. Rationale: v2 had messy ownership from organic growth — some services ran as root, some as gio (1000:1000), inconsistent across NFS mounts. v3 starts clean with a dedicated service UID/GID that is separate from both root and the human user account. NAS share permissions set to allow 2000:2000 read/write on all service shares.Service & Compute Decisions
Plex — Unraid Native Docker Container
Decision: Plex runs directly on Unraid as a Docker container, not on docker-prod-01. Rationale: i5-13400 QuickSync iGPU (Intel UHD 730) is mature, well-supported, and handles multiple simultaneous 1080p transcodes easily. Running Plex on Unraid eliminates an NFS hop — media files are local to the host doing the transcoding. No iGPU passthrough complexity on MS-A2 needed. Radeon 680M on the MS-A2 is available for future use but QuickSync is the better transcoding choice for this workload.Dual Radarr Instances — 1080p and 4K
Decision: Two separate Radarr instances: radarr-1080p and radarr-4k. Rationale: 4K WebDL for local viewing via Infuse/Apple TV 4K. 1080p for shared users to avoid transcoding overhead. Single Radarr instance managing both quality tiers gets messy with quality profiles and root folders. Separate instances are clean and independently configurable.Dual Sonarr Instances — TV and Anime
Decision: Two separate Sonarr instances: sonarr-tv and sonarr-anime. Rationale: Anime requires different quality profiles, release group logic, and indexer configuration vs. regular TV. A single instance handling both gets messy. Separate instances are independently configurable.qBitrr Replaces Custom qbit-automation Sidecar
Decision: qBitrr replaces the custom Python sidecar + cron job used in v2. Rationale: qBitrr manages all 4 ARR instances from a single installation with a proper web UI. The v2 custom sidecar was over-engineered and fragile. qBitrr handles seeding control, MAM compliance (14-day minimum seed time), and torrent health monitoring out of the box.Authentik — Dedicated VM, Not LXC
Decision: Authentik runs on a dedicated VM (auth-prod-01), not an LXC. Rationale: The Proxmox community officially removed the Authentik LXC helper script in May 2025 due to frequent breakage and a 14GB RAM requirement during build. VM is the only stable deployment option. Dedicated VM also isolates the IdP from other workloads — appropriate given Authentik is the SSO gateway for everything.Immich — Dedicated VM
Decision: Immich runs on a dedicated VM (immich-prod-01), not on docker-prod-01. Rationale: The Immich ML worker causes significant CPU spikes during face detection and CLIP indexing. Isolating it on a dedicated VM allows resource caps (CPU limits, RAM limits) without affecting the media stack running on docker-prod-01.Single Docker Host VM
Decision: One docker-prod-01 VM for all media and application containers. Rationale: Multi-VM Docker arrangements solve HA poorly and add operational complexity. All these containers will eventually migrate to k3s anyway — over-investing in a multi-VM Docker architecture that gets torn down in Phase 6 is wasteful. One clean VM, one NFS mount, one place to look when something breaks.Kubernetes — Sandbox First
Decision: k3s introduced as an isolated sandbox cluster in Phase 6. No production services migrate until the cluster is fully understood and stable. Rationale: A broken k3s cluster on top of an unstable foundation helps nobody. The sandbox runs in parallel with zero impact on running services. Only after the sandbox phase is complete will selective production migration begin. ARR stack, qBittorrent/Gluetun, and books stack intentionally stay in Docker permanently — hardlink and VPN killswitch workflows do not translate cleanly to Kubernetes.Cockpit on docker-prod-01
Decision: Install Cockpit alongside Docker on docker-prod-01. Rationale: Provides web-based day-to-day management of the VM — disk usage, network, logs, updates, file browser — without needing to SSH for routine tasks. Low overhead, high operational value.Backup Decisions
PBS for VM/LXC Snapshots
Decision: Proxmox Backup Server (PBS) on pve-prod-02 as a VM (pbs-prod-01), backing up all VMs on both nodes to the NAS ZFS mirror pool. Rationale: PBS is the right tool for VM-level backups in a Proxmox environment. Running it on pve-prod-02 means the backup server is on a different node than the primary workloads — a backup server on the same node it’s backing up is a single point of failure.Appdata Backups — rsync, Separate from PBS
Decision: Docker appdata and compose stacks backed up independently via hardened rsync script, separate from PBS. Rationale: PBS backs up VM disk images — it does not back up application data inside VMs. Container appdata (databases, configs) requires its own backup strategy. rsync to NAS /backups share (ZFS mirror pool) nightly, with Healthchecks.io heartbeat monitoring for silent failure detection.Decision Summary Table
| Category | Decision | Key Rationale |
|---|---|---|
| NAS Platform | Unraid 7.2.4 | Hybrid ZFS+parity fits risk tolerance |
| NAS CPU | i5-13400 | IO-bound workload; QuickSync for Plex |
| Primary Compute | MS-A2 Ryzen 9 7945HX | 16C/32T + dual 10GbE built-in |
| Boot Redundancy | ZFS RAID-1 mirror (MS-A2) | Single NVMe was v2’s fragility |
| VLANs | 4 VLANs (Mgmt/Trusted/Services/IoT) | Clean separation of concerns |
| HA | Disabled | Mismatched hardware = false security |
| QDevice | Enabled on Pi | Prevents split-brain in 2-node cluster |
| Reverse Proxy | Traefik | k3s ingress alignment |
| Cache Pool | Not installed | Downloads bypass cache (hardlink rule) |
| Dual Parity | Enabled | 5+ drives warrants dual parity |
| Plex Deployment | Unraid native Docker | QuickSync iGPU; no NFS hop |
| Authentik | Dedicated VM | LXC unstable; VM is stable |
| Immich | Dedicated VM | ML worker CPU spikes; isolation |
| k3s Approach | Sandbox first | No prod until proven stable |