Troubleshooting

This guide covers common issues you might encounter with Uncloud and how to resolve them.

Network connectivity problems

Machines can’t connect to each other

Symptoms: Containers on different machines can’t communicate, WireGuard tunnels not established Diagnosis:

# Check machine status
uc machine ls

# SSH into a machine and check WireGuard
ssh [email protected]
sudo wg show

Solutions:

Check firewall rules

# Ensure UDP port 51820 is open
sudo ufw status
sudo ufw allow 51820/udp

Verify WireGuard endpoints
```
uc machine ls
```
Check that the WIREGUARD ENDPOINTS column shows reachable addresses

Update machine endpoints If a machine’s IP changed:

uc machine update machine1 --endpoint NEW_IP:51820

Check WireGuard interface
```
ssh [email protected]
ip addr show wg0
```
The interface should have an IP like 10.210.X.1/24
Restart Uncloud daemon
```
sudo systemctl restart uncloud
```

Understanding WireGuard connectivity

Uncloud creates a WireGuard mesh where each machine connects to every other machine. If Machine A can’t reach Machine B:

Machine A needs to know Machine B’s endpoint (IP:port)
Machine B’s firewall must allow UDP port 51820
NAT routers must allow UDP hole punching (most do)

Check connectivity:

# From Machine A, ping Machine B's WireGuard IP
ping 10.210.1.1

If pings fail, check:

Firewall rules on both machines
NAT configuration
WireGuard logs: sudo journalctl -u uncloud -f

Containers can’t resolve service names

Symptoms: DNS resolution fails inside containers, curl http://service.internal doesn’t work Diagnosis:

# Check DNS configuration
uc service exec myservice cat /etc/resolv.conf

# Test DNS resolution
uc service exec myservice nslookup web-api.internal

Solutions:

Verify the service exists
```
uc service ls
```
Check internal DNS server The container’s /etc/resolv.conf should list the machine’s WireGuard IP:
```
nameserver 10.210.X.1
```

Restart the container

uc service scale myservice 0
uc service scale myservice 1

Check Uncloud daemon logs

ssh [email protected]
sudo journalctl -u uncloud | grep -i dns

Containers can’t reach the internet

Symptoms: curl https://google.com fails inside containers Diagnosis:

# Test from inside a container
uc service exec myservice ping 8.8.8.8
uc service exec myservice curl https://google.com

Solutions:

Check NAT/masquerading
```
ssh [email protected]
sudo iptables -t nat -L POSTROUTING -v -n
```
You should see a MASQUERADE rule for the Uncloud network
Verify Docker network
```
docker network inspect uncloud
```
Check that EnableIPMasquerade is true
Check DNS forwarding
```
uc service exec myservice cat /etc/resolv.conf
```
If only the Uncloud DNS server is listed, it should forward external queries

Service deployment failures

Deployment hangs or times out

Symptoms: uc deploy or uc run never completes Diagnosis:

# Check service status
uc service ls

# Check container state
uc service inspect myservice

# View logs
uc service logs myservice

Solutions:

Image pull failures If the image is private or large:

# Check Docker logs
ssh [email protected]
sudo journalctl -u docker -f

Solution: Use Unregistry for faster local image distribution:

uc build -t myapp .
uc push myapp
uc run myapp

Container crashes on startup
```
uc service logs myservice
```
Look for error messages in the logs
Resource constraints
```
ssh [email protected]
free -h
df -h
```
Check if the machine has enough memory or disk space

Machine unreachable

uc machine ls

If a machine shows “Down”, try:

ssh [email protected] sudo systemctl restart uncloud

Port already in use

Symptoms: Error like “bind: address already in use” Diagnosis:

ssh [email protected]
sudo netstat -tulpn | grep :80
sudo netstat -tulpn | grep :443

Solutions:

Stop conflicting services

# If nginx is running
sudo systemctl stop nginx
sudo systemctl disable nginx

Use different ports Instead of port 80, use a different port:
```
uc run -p 8080:80 myapp
```
Remove old containers
```
docker ps -a
docker rm -f CONTAINER_ID
```

Replicas not spreading across machines

Symptoms: All replicas scheduled on one machine Diagnosis:

uc service inspect myservice

Solutions:

Check machine availability
```
uc machine ls
```
Ensure machines are in “Up” state

Use placement constraints

services:
  web:
    image: myapp
    deploy:
      mode: replicated
      replicas: 3
      placement:
        machines:
          - machine1
          - machine2
          - machine3

Check machine resources Uncloud’s scheduler prefers machines with more available resources

Certificate issues

Let’s Encrypt certificate not obtained

Symptoms: HTTPS doesn’t work, browser shows “Not Secure” warning Diagnosis:

# Check Caddy logs
uc service logs caddy | grep -i acme

# View Caddy config
uc caddy config

Solutions:

Verify DNS resolution
```
dig app.example.com
```
DNS must point to your machine’s public IP
Check port 80 accessibility Let’s Encrypt uses HTTP-01 challenge on port 80:
```
curl http://app.example.com/.well-known/acme-challenge/test
```

Check firewall rules

ssh [email protected]
sudo ufw status
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

Wait for DNS propagation DNS changes can take up to 48 hours. Check with:
```
dig app.example.com @8.8.8.8
```

Check rate limits Let’s Encrypt has rate limits. If exceeded:

uc caddy deploy --caddyfile Caddyfile.staging

Use the staging environment in your Caddyfile:

{
  acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
}

Certificate expired

Symptoms: Browser shows “Your connection is not private” error Diagnosis:

# Check certificate expiry
uc service exec caddy ls -la /data/caddy/certificates/

Solutions:

Force renewal

# Restart Caddy to trigger renewal
uc service scale caddy 0
uc service scale caddy 1

Check Caddy logs
```
uc service logs caddy | grep -i renew
```

Delete old certificate

uc service exec caddy rm -rf /data/caddy/certificates/acme-v02.api.letsencrypt.org-directory/app.example.com

Then restart Caddy

Cluster state issues

Machine shows as “Down” but it’s running

Symptoms: uc machine ls shows a machine as Down, but you can SSH into it Diagnosis:

# Check daemon status on the machine
ssh [email protected]
sudo systemctl status uncloud
sudo systemctl status uncloud-corrosion

Solutions:

Restart services

sudo systemctl restart uncloud
sudo systemctl restart uncloud-corrosion

Check Corrosion state
```
sudo journalctl -u uncloud-corrosion -f
```
Look for replication errors
Verify cluster connectivity
```
sudo wg show
```
Check that WireGuard peers are connected

Services not showing up after deployment

Symptoms: uc service ls doesn’t show a newly deployed service Diagnosis:

# Check deployment status
uc service ls
uc service inspect SERVICE_NAME

Solutions:

Wait for state propagation The cluster uses eventual consistency. Wait 10-30 seconds and retry:
```
uc service ls
```

Check daemon logs

ssh [email protected]
sudo journalctl -u uncloud | tail -50

Verify Corrosion is running

sudo systemctl status uncloud-corrosion

Debug commands

Useful commands for debugging issues:

Network debugging

# Ping another machine's WireGuard IP
ping 10.210.1.1

# Test DNS resolution
nslookup service.internal
dig service.internal

# Check routing table
ip route show

# Check WireGuard status
sudo wg show

# Test container connectivity
uc service exec myservice ping 10.210.1.5
uc service exec myservice curl http://other-service.internal:8000

Service debugging

# View service details
uc service inspect myservice

# Check container logs
uc service logs myservice
uc service logs -f myservice  # Follow
uc service logs --since 1h myservice  # Last hour

# Execute commands in container
uc service exec myservice ps aux
uc service exec myservice env
uc service exec myservice cat /etc/resolv.conf

# Check port bindings
uc service exec myservice netstat -tulpn

Machine debugging

# Check machine status
uc machine ls

# View daemon logs
ssh [email protected] sudo journalctl -u uncloud -f

# Check Corrosion logs
ssh [email protected] sudo journalctl -u uncloud-corrosion -f

# Check Docker logs
ssh [email protected] sudo journalctl -u docker -f

# View system resources
ssh [email protected] free -h
ssh [email protected] df -h
ssh [email protected] top

Caddy debugging

# View Caddy config
uc caddy config

# Check Caddy logs
uc service logs caddy
uc service logs caddy | grep -i error

# Check certificate files
uc service exec caddy ls -la /data/caddy/certificates/

# Test Caddy admin API
uc service exec caddy curl --unix-socket /run/caddy/admin.sock http://localhost/config/

Where to get help

GitHub Issues

Report bugs and request features: GitHub: github.com/psviderski/uncloud/issues When opening an issue, include:

Uncloud version (uc version)
Machine OS and version
Complete error messages
Steps to reproduce
Relevant logs

Discord Community

Join the Uncloud Discord for:

Quick questions
General discussions
Community support
Feature ideas

Discord: discord.gg/eR35KQJhPu

GitHub Discussions

For longer-form discussions: Discussions: github.com/psviderski/uncloud/discussions Good for:

How-to questions
Architecture discussions
Sharing setups and configurations
Feature proposals

Common error messages

Error: bind: address already in use

Cause: Another process is using the portSolution:

# Find the process
sudo netstat -tulpn | grep :PORT

# Stop it
sudo kill PROCESS_ID

# Or use a different port
uc run -p 8080:80 myapp

Error: failed to connect to machine

Cause: SSH connection failedSolutions:

Check SSH access manually:
```
ssh [email protected]
```

Verify SSH key:

uc machine add [email protected] --ssh-key ~/.ssh/id_rsa

Check firewall rules:
```
sudo ufw allow 22/tcp
```

Error: machine not found

Cause: Machine doesn’t exist in the clusterSolution:

# List all machines
uc machine ls

# Add the machine
uc machine add [email protected] --name machine1

Error: service not found

Cause: Service doesn’t exist or hasn’t propagated yetSolutions:

List all services:
```
uc service ls
```
Wait for state propagation (10-30 seconds)
Check service name spelling

Error: no internet-reachable machines

Cause: No machines have public IPs or are behind firewallsSolutions:

Set public IP on a machine:

uc machine update machine1 --public-ip 203.0.113.10

Open firewall ports (80, 443):

sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

Configure port forwarding if behind NAT

Error: rate limit exceeded

Cause: Hit Let’s Encrypt rate limit (50 certs/week)Solutions:

Use staging environment:

{
  acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
}

Wait for quota reset (1 week)
Use DNS challenge instead of HTTP challenge (requires custom Caddyfile)

Next steps

Machine Management

Learn about machine operations

Monitoring

Set up monitoring and logging

Get Started

Core Concepts

Deployment

Operations

Advanced

Troubleshooting

Network connectivity problems

Machines can’t connect to each other

Containers can’t resolve service names

Containers can’t reach the internet

Service deployment failures

Deployment hangs or times out

Port already in use

Replicas not spreading across machines

Certificate issues

Let’s Encrypt certificate not obtained

Certificate expired

Cluster state issues

Machine shows as “Down” but it’s running

Services not showing up after deployment

Debug commands

Network debugging

Service debugging

Machine debugging

Caddy debugging

Where to get help

GitHub Issues

Discord Community

GitHub Discussions

Common error messages

Next steps

Machine Management

Monitoring

Build docs developers (and LLMs) love

Get Started

Core Concepts

Deployment

Operations

Advanced

​Network connectivity problems

​Machines can’t connect to each other

​Containers can’t resolve service names

​Containers can’t reach the internet

​Service deployment failures

​Deployment hangs or times out

​Port already in use

​Replicas not spreading across machines

​Certificate issues

​Let’s Encrypt certificate not obtained

​Certificate expired

​Cluster state issues

​Machine shows as “Down” but it’s running

​Services not showing up after deployment

​Debug commands

​Network debugging

​Service debugging

​Machine debugging

​Caddy debugging

​Where to get help

​GitHub Issues

​Discord Community

​GitHub Discussions

​Common error messages

​Next steps

Machine Management

Monitoring

Build docs developers (and LLMs) love

Network connectivity problems

Machines can’t connect to each other

Containers can’t resolve service names

Containers can’t reach the internet

Service deployment failures

Deployment hangs or times out

Port already in use

Replicas not spreading across machines

Certificate issues

Let’s Encrypt certificate not obtained

Certificate expired

Cluster state issues

Machine shows as “Down” but it’s running

Services not showing up after deployment

Debug commands

Network debugging

Service debugging

Machine debugging

Caddy debugging

Where to get help

GitHub Issues

Discord Community

GitHub Discussions

Common error messages

Next steps