Skip to main content

Overview

AlphaFold 3 requires multiple genetic and structural databases for generating multiple sequence alignments (MSAs) and structural templates. These databases enable the model to leverage evolutionary information.
Total size: ~252 GB compressed, ~630 GB uncompressed. Plan for sufficient storage and bandwidth.

Required Databases

AlphaFold 3 uses the following databases:

Protein Databases

BFD Small

Modified BFD (Big Fantastic Database)Clustered protein sequences for fast MSA generationVersion: 2022-09-28

MGnify

Metagenomic sequencesProtein sequences from metagenomics studiesVersion: 2022_05

UniProt

Universal Protein ResourceComprehensive protein sequence databaseVersion: 2021_04

UniRef90

UniProt Reference Clusters90% identity clustered UniProt sequencesVersion: 2022_05

RNA Databases

NT-RNA

Nucleotide RNAClustered RNA sequences from NCBIVersion: 2023_02_23

RFam

RNA familiesRNA sequence families databaseVersion: 14_9

RNACentral

RNA sequence databaseComprehensive RNA sequence collectionVersion: 21_0

Structural Databases

PDB mmCIF

Protein Data Bank structures~200,000 structures in mmCIF formatVersion: 2022-09-28

PDB Seqres

PDB sequencesSequence database for template searchVersion: 2022-09-28

Quick Installation

Automated Download Script

AlphaFold 3 provides a download script that fetches all required databases:
cd alphafold3
./fetch_databases.sh [<DB_DIR>]
DB_DIR
path
default:"$HOME/public_databases"
Target directory for databases. Must NOT be inside AlphaFold 3 repository.

Prerequisites

sudo apt install wget zstd

Running in Screen/Tmux

Download takes ~45 minutes on fast connections. Use screen or tmux for long-running processes.
# Start screen session
screen -S alphafold_dl

# Run download
cd alphafold3
./fetch_databases.sh /data/alphafold_databases

# Detach: Ctrl+A, then D
# Reattach later: screen -r alphafold_dl

Manual Installation

If you prefer manual download or need specific versions:

Protein Databases

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/bfd-first_non_consensus_sequences.fasta.gz
gunzip bfd-first_non_consensus_sequences.fasta.gz
Size: ~17 GB compressed, ~65 GB uncompressed

RNA Databases

wget https://storage.googleapis.com/alphafold-databases/v2.3.2/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.gz
gunzip nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.gz
Size: ~8 GB compressed, ~30 GB uncompressed

Structural Databases

# Download all mmCIF files
mkdir -p mmcif_files
rsync -rlpt -v -z --delete --port=33444 \
    rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ \
    mmcif_files/
Size: ~200,000 files, ~60 GB

Directory Structure

After installation, your database directory should look like:
/path/to/databases/
├── mmcif_files/
│   ├── 00/
│   ├── 01/
│   ├── 02/
│   ├── ...
│   └── zz/
├── bfd-first_non_consensus_sequences.fasta
├── mgy_clusters_2022_05.fa
├── nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
├── pdb_seqres_2022_09_28.fasta
├── rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta
├── rnacentral_active_seq_id_90_cov_80_linclust.fasta
├── uniprot_all_2021_04.fa
└── uniref90_2022_05.fa

Storage Optimization

Using SSD for Performance

Genetic search is I/O intensive. SSD storage provides 10-100× speedup over HDD.

Copying to SSD

# GCP: Mount and format SSD
sudo mkdir /mnt/disks/ssd
sudo mkfs.ext4 -F /dev/nvme0n1
sudo mount /dev/nvme0n1 /mnt/disks/ssd

# Copy databases
sudo rsync -avh --progress /path/to/databases/ /mnt/disks/ssd/databases/

Using RAM Disk (Maximum Performance)

# Create 300GB RAM disk
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=300G tmpfs /mnt/ramdisk

# Copy most-used databases
cp -r /path/to/databases/* /mnt/ramdisk/
RAM disk contents are lost on reboot. Only for temporary high-performance scenarios.

Partial SSD Setup

Use SSD for frequently accessed databases, HDD for others:
# Copy frequently used databases to SSD
cp /hdd/databases/uniref90_2022_05.fa /ssd/databases/
cp /hdd/databases/mgy_clusters_2022_05.fa /ssd/databases/
cp /hdd/databases/bfd-first_non_consensus_sequences.fasta /ssd/databases/

# Run with multiple db_dir flags
python run_alphafold.py \
    --db_dir=/ssd/databases \
    --db_dir=/hdd/databases \
    ...
AlphaFold 3 checks SSD first, falls back to HDD.

Database Sharding

For high-throughput environments with many CPU cores:

Why Shard?

Sharding enables parallel genetic search across many CPU cores, dramatically reducing wall-clock time.
Benefits:
  • Utilize 32+ core systems effectively
  • Reduce genetic search time by 10-50×
  • Maximize disk I/O parallelization

Sharding Process

1

Install seqkit

# From conda
conda install -c bioconda seqkit

# Or download binary
wget https://github.com/shenwei356/seqkit/releases/download/v2.5.1/seqkit_linux_amd64.tar.gz
tar -xzf seqkit_linux_amd64.tar.gz
sudo mv seqkit /usr/local/bin/
2

Shuffle Sequences

seqkit shuffle --two-pass uniref90_2022_05.fa > uniref90_shuffled.fa
Random shuffling ensures balanced shard sizes.
3

Split into Shards

# Split into 128 shards
seqkit split2 --by-part 128 uniref90_shuffled.fa
Output: uniref90_shuffled.fa.split/uniref90_shuffled.part_001.fa, etc.
4

Rename with Padding

cd uniref90_shuffled.fa.split
for i in {1..128}; do
    padded=$(printf "%05d" $((i-1)))
    mv uniref90_shuffled.part_$(printf "%03d" $i).fa \
       ../uniref90.fasta-${padded}-of-00128
done
5

Count Sequences/Bases

# For proteins: count sequences
seqkit stats -T uniref90.fasta-* | awk '{sum+=$4} END {print sum}'

# For RNA: count bases
seqkit stats -T ntrna.fasta-* | awk '{sum+=$5} END {print sum}'
Save these values for Z-value flags.

Using Sharded Databases

python run_alphafold.py \
    --uniref90_database_path="uniref90.fasta@128" \
    --uniref90_z_value=153742194 \
    --jackhmmer_n_cpu=2 \
    --jackhmmer_max_parallel_shards=16 \
    ...
@128
shard_count
Specifies 128 shards with pattern uniref90.fasta-XXXXX-of-00128
uniref90_z_value
integer
Total sequence count across all shards (for e-value scaling)
jackhmmer_max_parallel_shards
integer
Maximum shards to process in parallel
DatabaseUnsharded SizeRecommended Shards
UniRef90140 GB128-256
MGnify120 GB256-512
BFD Small65 GB64-128
UniProt100 GB128-256
NT-RNA30 GB64-256
RNACentral8 GB16-64
RFam50 MB8-16
For consistent performance, aim for equal shard sizes (~0.5-2 GB per shard).

Permissions and Access

Setting Permissions

Improper permissions cause opaque MSA tool errors. Ensure full read/write access.
# Set directory permissions
sudo chmod 755 --recursive /path/to/databases

# If running Docker, ensure container can read
sudo chown -R $(id -u):$(id -g) /path/to/databases

Docker Mounts

docker run -it \
    --volume /path/to/databases:/root/public_databases:ro \
    ...
Use :ro (read-only) suffix for safety.

Singularity Binds

singularity exec \
    --bind /path/to/databases:/root/public_databases \
    ...

Verifying Installation

Check Files Exist

ls -lh /path/to/databases/*.fa*
ls -lh /path/to/databases/mmcif_files/ | head

Test with AlphaFold

python run_alphafold.py \
    --json_path=test_input.json \
    --db_dir=/path/to/databases \
    --model_dir=/path/to/models \
    --output_dir=/path/to/output
Successful run confirms database setup.

Database Updates

AlphaFold 3 uses specific database versions from the paper. Newer versions may work but are not officially supported.
Using different database versions may affect prediction quality and reproducibility.
If you must update:
  1. Download new version to separate directory
  2. Test with known inputs
  3. Compare results to original databases
  4. Update --db_dir flags

Troubleshooting

Download Interrupted

# Resume wget download
wget -c <url>

# Resume rsync
rsync -avh --progress --partial /source /destination

Corrupted Files

# Check file integrity
md5sum uniref90_2022_05.fa.gz
# Compare with published checksum

# Re-download if needed
rm corrupted_file.gz
wget <url>

Insufficient Space

# Check available space
df -h /path/to/databases

# Clean up compressed files after extraction
rm *.gz

Permission Errors

# Fix ownership
sudo chown -R $USER:$USER /path/to/databases

# Fix permissions
chmod -R 755 /path/to/databases

MSA Tools Can’t Find Databases

# Check paths are absolute
python run_alphafold.py \
    --db_dir=/absolute/path/to/databases \
    ...

# Verify files are readable
cat /path/to/databases/uniref90_2022_05.fa | head

Database Licenses

All databases are available under permissive licenses:
  • BFD: CC BY 4.0
  • MGnify: CC0 1.0
  • PDB: CC0 1.0
  • UniProt/UniRef: CC BY 4.0
  • NT-RNA: Modified (see paper)
  • RFam: CC0 1.0
  • RNACentral: CC0 1.0
See AlphaFold 3 README for full attribution.

Next Steps

Performance

Optimize database access and search speed

Running Docker

Use databases with AlphaFold 3

Build docs developers (and LLMs) love