Database Setup - AlphaFold 3

Overview

AlphaFold 3 requires multiple genetic and structural databases for generating multiple sequence alignments (MSAs) and structural templates. These databases enable the model to leverage evolutionary information.

Total size: ~252 GB compressed, ~630 GB uncompressed. Plan for sufficient storage and bandwidth.

Required Databases

AlphaFold 3 uses the following databases:

Protein Databases

BFD Small

Modified BFD (Big Fantastic Database)Clustered protein sequences for fast MSA generationVersion: 2022-09-28

MGnify

Metagenomic sequencesProtein sequences from metagenomics studiesVersion: 2022_05

UniProt

Universal Protein ResourceComprehensive protein sequence databaseVersion: 2021_04

UniRef90

UniProt Reference Clusters90% identity clustered UniProt sequencesVersion: 2022_05

RNA Databases

NT-RNA

Nucleotide RNAClustered RNA sequences from NCBIVersion: 2023_02_23

RFam

RNA familiesRNA sequence families databaseVersion: 14_9

RNACentral

RNA sequence databaseComprehensive RNA sequence collectionVersion: 21_0

Structural Databases

PDB mmCIF

Protein Data Bank structures~200,000 structures in mmCIF formatVersion: 2022-09-28

PDB Seqres

PDB sequencesSequence database for template searchVersion: 2022-09-28

Quick Installation

Automated Download Script

AlphaFold 3 provides a download script that fetches all required databases:

cd alphafold3
./fetch_databases.sh [<DB_DIR>]

DB_DIR

path

default:"$HOME/public_databases"

Target directory for databases. Must NOT be inside AlphaFold 3 repository.

Prerequisites

sudo apt install wget zstd

Running in Screen/Tmux

Download takes ~45 minutes on fast connections. Use screen or tmux for long-running processes.

# Start screen session
screen -S alphafold_dl

# Run download
cd alphafold3
./fetch_databases.sh /data/alphafold_databases

# Detach: Ctrl+A, then D
# Reattach later: screen -r alphafold_dl

Manual Installation

If you prefer manual download or need specific versions:

Protein Databases

BFD Small
MGnify
UniProt
UniRef90

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/bfd-first_non_consensus_sequences.fasta.gz
gunzip bfd-first_non_consensus_sequences.fasta.gz

Size: ~17 GB compressed, ~65 GB uncompressed

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/mgy_clusters_2022_05.fa.gz
gunzip mgy_clusters_2022_05.fa.gz

Size: ~64 GB compressed, ~120 GB uncompressed

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/uniprot_all_2021_04.fa.gz
gunzip uniprot_all_2021_04.fa.gz

Size: ~50 GB compressed, ~100 GB uncompressed

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/uniref90_2022_05.fa.gz
gunzip uniref90_2022_05.fa.gz

Size: ~55 GB compressed, ~140 GB uncompressed

RNA Databases

NT-RNA
RFam
RNACentral

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.gz
gunzip nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.gz

Size: ~8 GB compressed, ~30 GB uncompressed

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta.gz
gunzip rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta.gz

Size: ~10 MB compressed, ~50 MB uncompressed

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/rnacentral_active_seq_id_90_cov_80_linclust.fasta.gz
gunzip rnacentral_active_seq_id_90_cov_80_linclust.fasta.gz

Size: ~2 GB compressed, ~8 GB uncompressed

Structural Databases

PDB mmCIF
PDB Seqres

# Download all mmCIF files
mkdir -p mmcif_files
rsync -rlpt -v -z --delete --port=33444 \
    rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ \
    mmcif_files/

Size: ~200,000 files, ~60 GB

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/pdb_seqres_2022_09_28.fasta.gz
gunzip pdb_seqres_2022_09_28.fasta.gz

Size: ~12 MB compressed, ~60 MB uncompressed

Directory Structure

After installation, your database directory should look like:

/path/to/databases/
├── mmcif_files/
│   ├── 00/
│   ├── 01/
│   ├── 02/
│   ├── ...
│   └── zz/
├── bfd-first_non_consensus_sequences.fasta
├── mgy_clusters_2022_05.fa
├── nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
├── pdb_seqres_2022_09_28.fasta
├── rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta
├── rnacentral_active_seq_id_90_cov_80_linclust.fasta
├── uniprot_all_2021_04.fa
└── uniref90_2022_05.fa

Storage Optimization

Using SSD for Performance

Genetic search is I/O intensive. SSD storage provides 10-100× speedup over HDD.

Copying to SSD

# GCP: Mount and format SSD
sudo mkdir /mnt/disks/ssd
sudo mkfs.ext4 -F /dev/nvme0n1
sudo mount /dev/nvme0n1 /mnt/disks/ssd

# Copy databases
sudo rsync -avh --progress /path/to/databases/ /mnt/disks/ssd/databases/

Using RAM Disk (Maximum Performance)

# Create 300GB RAM disk
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=300G tmpfs /mnt/ramdisk

# Copy most-used databases
cp -r /path/to/databases/* /mnt/ramdisk/

RAM disk contents are lost on reboot. Only for temporary high-performance scenarios.

Partial SSD Setup

Use SSD for frequently accessed databases, HDD for others:

# Copy frequently used databases to SSD
cp /hdd/databases/uniref90_2022_05.fa /ssd/databases/
cp /hdd/databases/mgy_clusters_2022_05.fa /ssd/databases/
cp /hdd/databases/bfd-first_non_consensus_sequences.fasta /ssd/databases/

# Run with multiple db_dir flags
python run_alphafold.py \
    --db_dir=/ssd/databases \
    --db_dir=/hdd/databases \
    ...

AlphaFold 3 checks SSD first, falls back to HDD.

Database Sharding

For high-throughput environments with many CPU cores:

Why Shard?

Sharding enables parallel genetic search across many CPU cores, dramatically reducing wall-clock time.

Benefits:

Utilize 32+ core systems effectively
Reduce genetic search time by 10-50×
Maximize disk I/O parallelization

Sharding Process

Install seqkit

# From conda
conda install -c bioconda seqkit

# Or download binary
wget https://github.com/shenwei356/seqkit/releases/download/v2.5.1/seqkit_linux_amd64.tar.gz
tar -xzf seqkit_linux_amd64.tar.gz
sudo mv seqkit /usr/local/bin/

Shuffle Sequences

seqkit shuffle --two-pass uniref90_2022_05.fa > uniref90_shuffled.fa

Random shuffling ensures balanced shard sizes.

Split into Shards

# Split into 128 shards
seqkit split2 --by-part 128 uniref90_shuffled.fa

Output: uniref90_shuffled.fa.split/uniref90_shuffled.part_001.fa, etc.

Rename with Padding

cd uniref90_shuffled.fa.split
for i in {1..128}; do
    padded=$(printf "%05d" $((i-1)))
    mv uniref90_shuffled.part_$(printf "%03d" $i).fa \
       ../uniref90.fasta-${padded}-of-00128
done

Count Sequences/Bases

# For proteins: count sequences
seqkit stats -T uniref90.fasta-* | awk '{sum+=$4} END {print sum}'

# For RNA: count bases
seqkit stats -T ntrna.fasta-* | awk '{sum+=$5} END {print sum}'

Save these values for Z-value flags.

Using Sharded Databases

python run_alphafold.py \
    --uniref90_database_path="uniref90.fasta@128" \
    --uniref90_z_value=153742194 \
    --jackhmmer_n_cpu=2 \
    --jackhmmer_max_parallel_shards=16 \
    ...

@128

shard_count

Specifies 128 shards with pattern uniref90.fasta-XXXXX-of-00128

uniref90_z_value

integer

Total sequence count across all shards (for e-value scaling)

jackhmmer_max_parallel_shards

integer

Maximum shards to process in parallel

Recommended Shard Counts

Database	Unsharded Size	Recommended Shards
UniRef90	140 GB	128-256
MGnify	120 GB	256-512
BFD Small	65 GB	64-128
UniProt	100 GB	128-256
NT-RNA	30 GB	64-256
RNACentral	8 GB	16-64
RFam	50 MB	8-16

For consistent performance, aim for equal shard sizes (~0.5-2 GB per shard).

Permissions and Access

Setting Permissions

Improper permissions cause opaque MSA tool errors. Ensure full read/write access.

# Set directory permissions
sudo chmod 755 --recursive /path/to/databases

# If running Docker, ensure container can read
sudo chown -R $(id -u):$(id -g) /path/to/databases

Docker Mounts

docker run -it \
    --volume /path/to/databases:/root/public_databases:ro \
    ...

Use :ro (read-only) suffix for safety.

Singularity Binds

singularity exec \
    --bind /path/to/databases:/root/public_databases \
    ...

Verifying Installation

Check Files Exist

ls -lh /path/to/databases/*.fa*
ls -lh /path/to/databases/mmcif_files/ | head

Test with AlphaFold

python run_alphafold.py \
    --json_path=test_input.json \
    --db_dir=/path/to/databases \
    --model_dir=/path/to/models \
    --output_dir=/path/to/output

Successful run confirms database setup.

Database Updates

AlphaFold 3 uses specific database versions from the paper. Newer versions may work but are not officially supported.

Using different database versions may affect prediction quality and reproducibility.

If you must update:

Download new version to separate directory
Test with known inputs
Compare results to original databases
Update --db_dir flags

Troubleshooting

Download Interrupted

# Resume wget download
wget -c <url>

# Resume rsync
rsync -avh --progress --partial /source /destination

Corrupted Files

# Check file integrity
md5sum uniref90_2022_05.fa.gz
# Compare with published checksum

# Re-download if needed
rm corrupted_file.gz
wget <url>

Insufficient Space

# Check available space
df -h /path/to/databases

# Clean up compressed files after extraction
rm *.gz

Permission Errors

# Fix ownership
sudo chown -R $USER:$USER /path/to/databases

# Fix permissions
chmod -R 755 /path/to/databases

MSA Tools Can’t Find Databases

# Check paths are absolute
python run_alphafold.py \
    --db_dir=/absolute/path/to/databases \
    ...

# Verify files are readable
cat /path/to/databases/uniref90_2022_05.fa | head

Database Licenses

All databases are available under permissive licenses:

BFD: CC BY 4.0
MGnify: CC0 1.0
PDB: CC0 1.0
UniProt/UniRef: CC BY 4.0
NT-RNA: Modified (see paper)
RFam: CC0 1.0
RNACentral: CC0 1.0

See AlphaFold 3 README for full attribution.

Next Steps

Performance

Optimize database access and search speed

Running Docker

Use databases with AlphaFold 3

Getting Started

Core Concepts

User Guides

Advanced Usage

Resources

​Overview

​Required Databases

​Protein Databases

BFD Small

MGnify

UniProt

UniRef90

​RNA Databases

NT-RNA

RFam

RNACentral

​Structural Databases

PDB mmCIF

PDB Seqres

​Quick Installation

​Automated Download Script

​Prerequisites

​Running in Screen/Tmux

​Manual Installation

​Protein Databases

​RNA Databases

​Structural Databases

​Directory Structure

​Storage Optimization

​Using SSD for Performance

​Copying to SSD

​Using RAM Disk (Maximum Performance)

​Partial SSD Setup

​Database Sharding

​Why Shard?

​Sharding Process

​Using Sharded Databases

​Recommended Shard Counts

​Permissions and Access

​Setting Permissions

​Docker Mounts

​Singularity Binds

​Verifying Installation

​Check Files Exist

​Test with AlphaFold

​Database Updates

​Troubleshooting

​Download Interrupted

​Corrupted Files

​Insufficient Space

​Permission Errors

​MSA Tools Can’t Find Databases

​Database Licenses

​Next Steps

Performance

Running Docker

Build docs developers (and LLMs) love

Overview

Required Databases

Protein Databases

RNA Databases

Structural Databases

Quick Installation

Automated Download Script

Prerequisites

Running in Screen/Tmux

Manual Installation

Protein Databases

RNA Databases

Structural Databases

Directory Structure

Storage Optimization

Using SSD for Performance

Copying to SSD

Using RAM Disk (Maximum Performance)

Partial SSD Setup

Database Sharding

Why Shard?

Sharding Process

Using Sharded Databases

Recommended Shard Counts

Permissions and Access

Setting Permissions

Docker Mounts

Singularity Binds

Verifying Installation

Check Files Exist

Test with AlphaFold

Database Updates

Troubleshooting

Download Interrupted

Corrupted Files

Insufficient Space

Permission Errors

MSA Tools Can’t Find Databases

Database Licenses

Next Steps