Overview
AlphaFold 3 requires multiple genetic and structural databases for generating multiple sequence alignments (MSAs) and structural templates. These databases enable the model to leverage evolutionary information.
Total size : ~252 GB compressed, ~630 GB uncompressed. Plan for sufficient storage and bandwidth.
Required Databases
AlphaFold 3 uses the following databases:
Protein Databases
BFD Small Modified BFD (Big Fantastic Database)Clustered protein sequences for fast MSA generation Version: 2022-09-28
MGnify Metagenomic sequences Protein sequences from metagenomics studies Version: 2022_05
UniProt Universal Protein Resource Comprehensive protein sequence database Version: 2021_04
UniRef90 UniProt Reference Clusters 90% identity clustered UniProt sequences Version: 2022_05
RNA Databases
NT-RNA Nucleotide RNA Clustered RNA sequences from NCBI Version: 2023_02_23
RFam RNA families RNA sequence families database Version: 14_9
RNACentral RNA sequence database Comprehensive RNA sequence collection Version: 21_0
Structural Databases
PDB mmCIF Protein Data Bank structures ~200,000 structures in mmCIF format Version: 2022-09-28
PDB Seqres PDB sequences Sequence database for template search Version: 2022-09-28
Quick Installation
Automated Download Script
AlphaFold 3 provides a download script that fetches all required databases:
cd alphafold3
./fetch_databases.sh [<DB_DIR>]
DB_DIR
path
default: "$HOME/public_databases"
Target directory for databases. Must NOT be inside AlphaFold 3 repository.
Prerequisites
sudo apt install wget zstd
Running in Screen/Tmux
Download takes ~45 minutes on fast connections. Use screen or tmux for long-running processes.
# Start screen session
screen -S alphafold_dl
# Run download
cd alphafold3
./fetch_databases.sh /data/alphafold_databases
# Detach: Ctrl+A, then D
# Reattach later: screen -r alphafold_dl
Manual Installation
If you prefer manual download or need specific versions:
Protein Databases
BFD Small
MGnify
UniProt
UniRef90
wget https://storage.googleapis.com/alphafold-databases/v2.3.2/bfd-first_non_consensus_sequences.fasta.gz
gunzip bfd-first_non_consensus_sequences.fasta.gz
Size : ~17 GB compressed, ~65 GB uncompressedwget https://storage.googleapis.com/alphafold-databases/v2.3.2/mgy_clusters_2022_05.fa.gz
gunzip mgy_clusters_2022_05.fa.gz
Size : ~64 GB compressed, ~120 GB uncompressedwget https://storage.googleapis.com/alphafold-databases/v2.3.2/uniprot_all_2021_04.fa.gz
gunzip uniprot_all_2021_04.fa.gz
Size : ~50 GB compressed, ~100 GB uncompressedwget https://storage.googleapis.com/alphafold-databases/v2.3.2/uniref90_2022_05.fa.gz
gunzip uniref90_2022_05.fa.gz
Size : ~55 GB compressed, ~140 GB uncompressed
RNA Databases
wget https://storage.googleapis.com/alphafold-databases/v2.3.2/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.gz
gunzip nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta.gz
Size : ~8 GB compressed, ~30 GB uncompressedwget https://storage.googleapis.com/alphafold-databases/v2.3.2/rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta.gz
gunzip rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta.gz
Size : ~10 MB compressed, ~50 MB uncompressedwget https://storage.googleapis.com/alphafold-databases/v2.3.2/rnacentral_active_seq_id_90_cov_80_linclust.fasta.gz
gunzip rnacentral_active_seq_id_90_cov_80_linclust.fasta.gz
Size : ~2 GB compressed, ~8 GB uncompressed
Structural Databases
# Download all mmCIF files
mkdir -p mmcif_files
rsync -rlpt -v -z --delete --port=33444 \
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ \
mmcif_files/
Size : ~200,000 files, ~60 GBwget https://storage.googleapis.com/alphafold-databases/v2.3.2/pdb_seqres_2022_09_28.fasta.gz
gunzip pdb_seqres_2022_09_28.fasta.gz
Size : ~12 MB compressed, ~60 MB uncompressed
Directory Structure
After installation, your database directory should look like:
/path/to/databases/
├── mmcif_files/
│ ├── 00/
│ ├── 01/
│ ├── 02/
│ ├── ...
│ └── zz/
├── bfd-first_non_consensus_sequences.fasta
├── mgy_clusters_2022_05.fa
├── nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta
├── pdb_seqres_2022_09_28.fasta
├── rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta
├── rnacentral_active_seq_id_90_cov_80_linclust.fasta
├── uniprot_all_2021_04.fa
└── uniref90_2022_05.fa
Storage Optimization
Genetic search is I/O intensive. SSD storage provides 10-100× speedup over HDD.
Copying to SSD
# GCP: Mount and format SSD
sudo mkdir /mnt/disks/ssd
sudo mkfs.ext4 -F /dev/nvme0n1
sudo mount /dev/nvme0n1 /mnt/disks/ssd
# Copy databases
sudo rsync -avh --progress /path/to/databases/ /mnt/disks/ssd/databases/
# Create 300GB RAM disk
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=300G tmpfs /mnt/ramdisk
# Copy most-used databases
cp -r /path/to/databases/ * /mnt/ramdisk/
RAM disk contents are lost on reboot. Only for temporary high-performance scenarios.
Partial SSD Setup
Use SSD for frequently accessed databases, HDD for others:
# Copy frequently used databases to SSD
cp /hdd/databases/uniref90_2022_05.fa /ssd/databases/
cp /hdd/databases/mgy_clusters_2022_05.fa /ssd/databases/
cp /hdd/databases/bfd-first_non_consensus_sequences.fasta /ssd/databases/
# Run with multiple db_dir flags
python run_alphafold.py \
--db_dir=/ssd/databases \
--db_dir=/hdd/databases \
...
AlphaFold 3 checks SSD first, falls back to HDD.
Database Sharding
For high-throughput environments with many CPU cores:
Why Shard?
Sharding enables parallel genetic search across many CPU cores, dramatically reducing wall-clock time.
Benefits :
Utilize 32+ core systems effectively
Reduce genetic search time by 10-50×
Maximize disk I/O parallelization
Sharding Process
Install seqkit
# From conda
conda install -c bioconda seqkit
# Or download binary
wget https://github.com/shenwei356/seqkit/releases/download/v2.5.1/seqkit_linux_amd64.tar.gz
tar -xzf seqkit_linux_amd64.tar.gz
sudo mv seqkit /usr/local/bin/
Shuffle Sequences
seqkit shuffle --two-pass uniref90_2022_05.fa > uniref90_shuffled.fa
Random shuffling ensures balanced shard sizes.
Split into Shards
# Split into 128 shards
seqkit split2 --by-part 128 uniref90_shuffled.fa
Output: uniref90_shuffled.fa.split/uniref90_shuffled.part_001.fa, etc.
Rename with Padding
cd uniref90_shuffled.fa.split
for i in { 1..128} ; do
padded = $( printf "%05d" $(( i-1 )))
mv uniref90_shuffled.part_ $( printf "%03d" $i ) .fa \
../uniref90.fasta- ${ padded } -of-00128
done
Count Sequences/Bases
# For proteins: count sequences
seqkit stats -T uniref90.fasta- * | awk '{sum+=$4} END {print sum}'
# For RNA: count bases
seqkit stats -T ntrna.fasta- * | awk '{sum+=$5} END {print sum}'
Save these values for Z-value flags.
Using Sharded Databases
python run_alphafold.py \
--uniref90_database_path= "uniref90.fasta@128" \
--uniref90_z_value=153742194 \
--jackhmmer_n_cpu=2 \
--jackhmmer_max_parallel_shards=16 \
...
Specifies 128 shards with pattern uniref90.fasta-XXXXX-of-00128
Total sequence count across all shards (for e-value scaling)
jackhmmer_max_parallel_shards
Maximum shards to process in parallel
Recommended Shard Counts
Database Unsharded Size Recommended Shards UniRef90 140 GB 128-256 MGnify 120 GB 256-512 BFD Small 65 GB 64-128 UniProt 100 GB 128-256 NT-RNA 30 GB 64-256 RNACentral 8 GB 16-64 RFam 50 MB 8-16
For consistent performance, aim for equal shard sizes (~0.5-2 GB per shard).
Permissions and Access
Setting Permissions
Improper permissions cause opaque MSA tool errors. Ensure full read/write access.
# Set directory permissions
sudo chmod 755 --recursive /path/to/databases
# If running Docker, ensure container can read
sudo chown -R $( id -u ) : $( id -g ) /path/to/databases
Docker Mounts
docker run -it \
--volume /path/to/databases:/root/public_databases:ro \
...
Use :ro (read-only) suffix for safety.
Singularity Binds
singularity exec \
--bind /path/to/databases:/root/public_databases \
...
Verifying Installation
Check Files Exist
ls -lh /path/to/databases/ * .fa *
ls -lh /path/to/databases/mmcif_files/ | head
Test with AlphaFold
python run_alphafold.py \
--json_path=test_input.json \
--db_dir=/path/to/databases \
--model_dir=/path/to/models \
--output_dir=/path/to/output
Successful run confirms database setup.
Database Updates
AlphaFold 3 uses specific database versions from the paper. Newer versions may work but are not officially supported.
Using different database versions may affect prediction quality and reproducibility.
If you must update:
Download new version to separate directory
Test with known inputs
Compare results to original databases
Update --db_dir flags
Troubleshooting
Download Interrupted
# Resume wget download
wget -c < ur l >
# Resume rsync
rsync -avh --progress --partial /source /destination
Corrupted Files
# Check file integrity
md5sum uniref90_2022_05.fa.gz
# Compare with published checksum
# Re-download if needed
rm corrupted_file.gz
wget < ur l >
Insufficient Space
# Check available space
df -h /path/to/databases
# Clean up compressed files after extraction
rm * .gz
Permission Errors
# Fix ownership
sudo chown -R $USER : $USER /path/to/databases
# Fix permissions
chmod -R 755 /path/to/databases
# Check paths are absolute
python run_alphafold.py \
--db_dir=/absolute/path/to/databases \
...
# Verify files are readable
cat /path/to/databases/uniref90_2022_05.fa | head
Database Licenses
All databases are available under permissive licenses:
BFD : CC BY 4.0
MGnify : CC0 1.0
PDB : CC0 1.0
UniProt/UniRef : CC BY 4.0
NT-RNA : Modified (see paper)
RFam : CC0 1.0
RNACentral : CC0 1.0
See AlphaFold 3 README for full attribution.
Next Steps
Performance Optimize database access and search speed
Running Docker Use databases with AlphaFold 3