Maintenance and Operations

This guide covers essential maintenance operations for running a Copr instance, including backups, monitoring, upgrades, and routine tasks.

Backup and Recovery

Backend Storage Backups

The backend uses RAID for redundancy and rsnapshot for incremental backups to storinator01.

Backup Schedule

Backups run via cron on the backend server:

crontab -l -u copr
# Typically runs weekly (Fridays)
0 3 * * 5 ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null

Verifying Backend Backups

Check the most recent backup start time:

ssh copr-be
xz -d < /var/log/cron-20241101.xz | grep '(copr) CMD'
# Look for:
# Nov  1 03:00:02 copr-be CROND[3482216]: (copr) CMD (ionice --class=idle /usr/local/bin/rsnapshot_copr_backend >/dev/null)

Find a build that completed just before that time (e.g., build 8185411)
Verify it exists on storinator01:

ssh [email protected]
find /srv/nfs/copr-be/copr-be-copr-user/backup/.sync/var/lib/copr/public_html/results/@copr/copr-pull-requests:pr:3473 | grep 8185411 | grep rpm$

Check available disk space:

df -h /srv/nfs/copr-be

Backups typically take several days to complete. Don’t verify a build if the backup is still in progress.

Backend Recovery Procedure

Recovery from backups is a multi-day operation. Plan carefully and don’t rush.

The rsync from storinator runs at ~110 MB/s. For 20TB of data, expect 5 days of sync time. Step 1: Prepare a new RAID array Spawn a temporary instance:

git clone [email protected]:fedora-copr/ansible-fedora-copr.git
cd ansible-fedora-copr
./run-playbook pb-backup-recovery-01.yml

Run the configuration playbook:

ansible-playbook ./pb-backup-recovery-02.yml -i 54.81.xxx.xx, -u fedora

Step 2: Create RAID array SSH to the instance and partition disks:

for i in /dev/nvme[1-4]n1 ; do \
    (echo gpt ; echo n ; echo ; echo ; echo ; echo ; echo w ) | sudo fdisk $i
done

Create RAID 10:

mdadm --create /dev/md0 --level raid10 \
    --name copr-backend-data --raid-disks 4 /dev/nvme[1-4]n1p1

Format and mount:

mkfs.ext4 /dev/md0 -L copr-repo
tune2fs -m0 /dev/md0
mkdir /mnt/data
chown copr:copr /mnt/data
mount /dev/disk/by-label/copr-repo /mnt/data/

Step 3: Workaround kernel bug

There’s a kernel bug causing IO operations to hang. Apply this workaround:

echo frozen > /sys/block/md0/md/sync_action

After data is copied (about a week), unfreeze:

echo idle > /sys/block/md0/md/sync_action

Step 4: Setup SSH keys Run in tmux as copr user:

tmux
su - copr
ssh-keygen -t rsa

Copy ~/.ssh/id_rsa.pub to storinator01:

ssh [email protected]
sudo su - copr
vim ~/.ssh/authorized_keys  # Add the public key

Step 5: Sync the data From the temporary instance:

time until rsync -av -H --info=progress2 --rsh=ssh \
    --max-alloc=4G \
    [email protected]:/srv/nfs/copr-be/copr-be-copr-user/backup/.sync/var/lib/copr/public_html/ \
    /mnt/data; \
    do true; done

This command will retry on failure and run for approximately 5 days.

Step 6: Attach volumes to production Umount from temporary instance:

umount /mnt/data/
mdadm --stop /dev/md0

In AWS EC2 console:

Detach all copr-backend-backup-test-raid-10 volumes from temporary instance
Stop the backend service: systemctl stop copr-backend.target
Detach old volumes from production instance
Attach recovery volumes to production instance
Assemble RAID and mount

Step 7: Fix permissions Temporarily disable SELinux:

setenforce 0

Start services:

systemctl start lighttpd.service copr-backend.target

Relabel filesystem:

time copr-selinux-relabel
setenforce 1

Database Backups

Private (Complete) Backups

Complete dumps with sensitive data are stored in /backups/ on the frontend:

ssh copr-fe
su - postgres
/usr/local/bin/backup-database coprdb

This script sleeps initially and takes 20+ minutes due to XZ compression. The backup contains sensitive data like API tokens - never download or publish it.

Backups are automatically pulled by rdiff-backup configured via Ansible: https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/rdiff-backup.yml Verify backups exist:

ls -alh /backups/
# Should show recent timestamp:
# -rw-r--r--. 1 postgres postgres 662M Nov  5 01:21 coprdb-2024-11-05.dump.xz

Public Database Dumps

Sanitized dumps (without private tables) are available at: https://copr.fedorainfracloud.org/db_dumps/ Generated by:

cat /etc/cron.d/cron-backup-database-coprdb

These dumps are suitable for:

Testing and debugging
Development environments
Public experimentation

Keygen and DistGit Backups

Keygen Volume Snapshots

GPG keypairs on /var/lib/copr-keygen are protected by EC2 volume snapshots. Verify in AWS Console:

Go to EC2 > Volumes > vol-0108e05e229bf7eaf
Check snapshots are being created in Ohio (us-east-2)
Filter with tag: FedoraGroup=copr

Snapshots are stored in us-east-2 (Ohio), not us-east-1 (Virginia).

DistGit Snapshots

DistGit data is extensive (terabytes) but not critical:

Periodic EC2 volume snapshots are taken
In case of failure, restore from snapshot or initialize empty volume
No formal backup process due to data being reproducible

System Upgrades

Upgrading Persistent Instances

Upgrading Copr infrastructure to new Fedora versions involves creating fresh VMs and migrating data.

Pre-Upgrade Preparation

1. Announce the outage (see Outage Announcements) 2. Check for hotfixes On the old instance:

rpm -Va | grep -v -e /etc/ -e /boot/

Review files marked with S.5....T. - these have been modified. Also check: https://github.com/fedora-copr/copr/issues?q=label%3Ahot-fixed+is%3Aclosed 3. Clone helper repository

git clone [email protected]:fedora-copr/ansible-fedora-copr.git
cd ansible-fedora-copr

Review and update group_vars/{dev,prod}.yml with:

Correct data volume IDs from EC2
New AMI ID for Fedora N+2 from https://fedoraproject.org/cloud/download/
Instance types, names, IP addresses

4. Backup Let’s Encrypt certificates

sudo rbac-playbook -l copr-keygen.aws.fedoraproject.org \
    groups/copr-keygen.yml -t certbot

Do this for all instances (frontend, backend, distgit, keygen).

Launch New Instances

Spawn new VM:

opts=( -e copr_instance=dev -e server_id=keygen )
ansible-playbook play-vm-migration-01-new-box.yml "${opts[@]}"

Note the output:

ElasticIP: not specified
Instance ID: i-04ba36eb360187572
Network ID: eni-048189f432f068270
Private IP: 172.30.2.94

Update group_vars/{dev,prod}.yml with new instance and network IDs.

Backend Pre-Preparation

For backend only: Run the playbook against a temporary hostname before the outage to minimize downtime.

Ensure copr-be-dev-temp.aws.fedoraproject.org is in inventory:

[copr_back_dev_aws]
copr-be-dev.aws.fedoraproject.org
copr-be-dev-temp.aws.fedoraproject.org birthday=yes

Run playbook:

sudo rbac-playbook -l copr-be-dev-temp.aws.fedoraproject.org \
    groups/copr-backend.yml

Outage Window

1. Announce ongoing outage 2. Migrate IPs and volumes For backend:

ansible-playbook play-vm-migration-02-migrate-backend-box.yml "${opts[@]}"

Follow manual instructions during playbook execution for DB backups and consistency checks.

For other services:

ansible-playbook play-vm-migration-02-migrate-non-backend-box.yml "${opts[@]}"

3. Provision new instances In fedora-infra/ansible, set birthday=yes:

[copr_front_dev_aws]
copr.stg.fedoraproject.org birthday=yes

Run upgrade playbook:

sudo rbac-playbook -l copr-fe-dev.aws.fedoraproject.org \
    manual/copr/copr-frontend-upgrade.yml
sudo rbac-playbook -l copr-fe-dev.aws.fedoraproject.org \
    groups/copr-frontend.yml

4. Upgrade PostgreSQL (Frontend only) Stop httpd:

systemctl stop httpd

Upgrade database:

dnf install postgresql-upgrade
postgresql-setup --upgrade
systemctl start postgresql

Rebuild indexes:

su postgres
reindexdb --all

Restart httpd:

systemctl start httpd

5. Apply hotfixes and finalize Revert birthday=yes and set services_disabled: false. Rerun playbooks until all services are operational.

Post-Upgrade

1. Test reboot

reboot

Debug any boot issues now rather than during a future emergency. 2. Rename instances Remove -new suffix from new instances, add -old to old ones:

opts=( -e copr_instance=dev )
ansible-playbook play-vm-migration-03-rename-instances.yml "${opts[@]}"

3. Terminate old instances In AWS EC2:

Disable termination protection: Actions → Instance settings → Change termination protection
Terminate instances

Keep old VMs for a few days if you want to retain DB /backups. 4. Announce resolution

Upgrading Builders

Builder VMs are ephemeral and automatically use the latest packages from infra repos.

If copr-rpmbuild is updated, terminate resalloc VMs to force recreation with new version.

Monitoring

Monitoring Services

Copr uses multiple monitoring systems:

Nagios - Primary monitoring for Fedora Infrastructure
- https://nagios.fedoraproject.org/nagios/cgi-bin//status.cgi?hostgroup=copr_all_instances_aws
- Checks: availability, storage, hypervisor health
Nagios External - External availability checks
- https://nagios-external.fedoraproject.org/nagios/cgi-bin//status.cgi?hostgroup=copr_all_instances_aws
Prometheus - Metrics and Grafana dashboards (internal to Red Hat)
UptimeRobot - Geographic CDN availability (AWS CloudFront)

Health Checks

copr-ping Test

Periodic end-to-end test that submits a build through the entire stack:

# Configured on backend
cat /etc/cron.d/copr-ping

Monitor results: https://copr.fedorainfracloud.org/coprs/g/copr/copr-ping/builds/

Storage Analysis

Weekly storage analysis generates usage statistics:

/usr/bin/copr-backend-analyze-results

View statistics: https://copr-be.cloud.fedoraproject.org/stats/index.html

Manual Health Checks

Verify all services:

# Check Copr package versions
./releng/run-on-all-infra 'rpm -qa | grep copr'

# Check for available updates
./releng/run-on-all-infra 'dnf copr list'

# Check service status
systemctl status copr-backend.target
systemctl status httpd
systemctl status postgresql

Log Locations

# Frontend
/var/log/httpd/error_log
/var/log/copr-frontend/

# Backend  
/var/log/copr-backend/
/var/log/lighttpd/

# Database
/var/lib/pgsql/data/log/

# DistGit
/var/log/copr-dist-git/

Routine Maintenance Tasks

Managing Chroots

Enable New Fedora Release

Run this BEFORE Fedora branching happens to copy builds with correct dist tags.

ssh copr-fe
su - copr-fe
copr-frontend branch-fedora 31

This creates fedora-31-* chroots and forks latest successful Rawhide builds. Once actions are processed (check https://copr.fedorainfracloud.org/status/stats/), activate:

copr-frontend alter-chroot --action activate \
    fedora-31-x86_64 fedora-31-i386 \
    fedora-31-ppc64le fedora-31-aarch64 \
    fedora-31-armhfp fedora-31-s390x

Disable EOL Chroots

Check that other services (Fedora Review Service) don’t depend on the chroot before disabling.

fv=34
copr-frontend alter-chroot --action eol \
    fedora-$fv-x86_64 fedora-$fv-i386 \
    fedora-$fv-ppc64le fedora-$fv-aarch64 \
    fedora-$fv-armhfp fedora-$fv-s390x

This disables builds but preserves all repositories and data.

EOL Lifeless Rolling Chroots

Automatically mark inactive rolling chroots (Rawhide, CentOS Stream):

copr-frontend eol-lifeless-rolling-chroots

Add to cron for daily execution:

# /etc/cron.d/copr-frontend-optional
0 2 * * * copr-fe /usr/bin/copr-frontend eol-lifeless-rolling-chroots

Database Maintenance

Manual Backup

su - postgres
/usr/local/bin/backup-database coprdb

Vacuum and Analyze

su - postgres
vacuumdb --all --analyze

Check Database Size

su - postgres
psql -c "SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;"

Announcing Outages

Follow this workflow for planned maintenance: 1. Schedule outage Create ticket: https://pagure.io/fedora-infrastructure/new_issue 2. Announce planned outage

Update status.fedoraproject.org
Email: [email protected]
Twitter/Mastodon via Fedora Infrastructure

3. Start maintenance - announce ongoing Update status page to “Ongoing outage” 4. Complete maintenance - announce resolution

Update status page to “Resolved”
Email copr-devel with changes summary
Close infrastructure ticket

Emergency Procedures

Backend Down - Builds Failing

Check backend services:

systemctl status copr-backend.target
systemctl status lighttpd

Check RAID status:

cat /proc/mdstat
mdadm --detail /dev/md0

Check disk space:

df -h /var/lib/copr/public_html/results

Review logs:

journalctl -u copr-backend -n 100
tail -100 /var/log/copr-backend/backend.log

Frontend Down - Website Inaccessible

Check httpd:

systemctl status httpd
journalctl -u httpd -n 50

Check database:

systemctl status postgresql
su - postgres -c "psql -c 'SELECT 1'"

Check disk space:

df -h

Database Issues

Check connections:

su - postgres
psql -c "SELECT count(*) FROM pg_stat_activity;"

Check for long queries:

psql -c "SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' ORDER BY duration DESC;"

Check locks:

psql -c "SELECT * FROM pg_locks WHERE NOT granted;"

Additional Resources

Deployment Options

Learn how to deploy Copr in different environments

Release Process

Understand the Copr release workflow

Fedora Infra Copr SOP

Official Fedora Infrastructure procedures

Architecture

Understand Copr’s system architecture

Overview

Components

Operations

​Backup and Recovery

​Backend Storage Backups

​Backup Schedule

​Verifying Backend Backups

​Backend Recovery Procedure

​Database Backups

​Private (Complete) Backups

​Public Database Dumps

​Keygen and DistGit Backups

​Keygen Volume Snapshots

​DistGit Snapshots

​System Upgrades

​Upgrading Persistent Instances

​Pre-Upgrade Preparation

​Launch New Instances

​Backend Pre-Preparation

​Outage Window

​Post-Upgrade

​Upgrading Builders

​Monitoring

​Monitoring Services

​Health Checks

​copr-ping Test

​Storage Analysis

​Manual Health Checks

​Log Locations

​Routine Maintenance Tasks

​Managing Chroots

​Enable New Fedora Release

​Disable EOL Chroots

​EOL Lifeless Rolling Chroots

​Database Maintenance

​Manual Backup

​Vacuum and Analyze

​Check Database Size

​Announcing Outages

​Emergency Procedures

​Backend Down - Builds Failing

​Frontend Down - Website Inaccessible

​Database Issues

​Additional Resources

Deployment Options

Release Process

Fedora Infra Copr SOP

Architecture

Build docs developers (and LLMs) love

Backup and Recovery

Backend Storage Backups

Backup Schedule

Verifying Backend Backups

Backend Recovery Procedure

Database Backups

Private (Complete) Backups

Public Database Dumps

Keygen and DistGit Backups

Keygen Volume Snapshots

DistGit Snapshots

System Upgrades

Upgrading Persistent Instances

Pre-Upgrade Preparation

Launch New Instances

Backend Pre-Preparation

Outage Window

Post-Upgrade

Upgrading Builders

Monitoring

Monitoring Services

Health Checks

copr-ping Test

Storage Analysis

Manual Health Checks

Log Locations

Routine Maintenance Tasks

Managing Chroots

Enable New Fedora Release

Disable EOL Chroots

EOL Lifeless Rolling Chroots

Database Maintenance

Manual Backup

Vacuum and Analyze

Check Database Size

Announcing Outages

Emergency Procedures

Backend Down - Builds Failing

Frontend Down - Website Inaccessible

Database Issues

Additional Resources