MPI parallelization, domain decomposition, and performance scaling in Dedalus
Dedalus uses MPI (Message Passing Interface) for distributed-memory parallelization, enabling large-scale simulations across multiple processors and nodes. The framework automatically handles domain decomposition and data distribution.
The Distributor object directs parallelized distribution and transformation of fields. It manages how D-dimensional data fields are split over an R-dimensional mesh of MPI processes, where R < D.
import dedalus.public as d3import numpy as npfrom mpi4py import MPI# Default: 1D process mesh using all available processescoords = d3.CartesianCoordinates('x', 'y', 'z')dist = d3.Distributor(coords, dtype=np.float64)# Get MPI informationcomm = dist.commrank = comm.ranksize = comm.size
By default, Dedalus creates a 1D process mesh using all available MPI processes (MPI.COMM_WORLD).
For a 3D problem with shape (Nx, Ny, Nz), the mesh shape should have length 2.
3
Ensure even divisibility
4
Ideal load balancing occurs when basis sizes are evenly divisible by mesh sizes. For example, with Nx = 256 and Ny = 256, a mesh of [16, 16] is ideal.
5
Prefer isotropic meshes
6
For a given number of processes, an “isotropic” mesh with similar values in each dimension is theoretically most efficient (e.g., [8, 8] rather than [4, 16] for 64 processes).
7
Avoid empty cores
8
Ensure no process receives zero data. For example, a mesh of [128, 2] for a problem with shape (64, 64, 64) will result in many empty cores.
# 3D problem: 256 x 256 x 128 gridNx, Ny, Nz = 256, 256, 128# Running on 64 processes# Good choices:mesh = [8, 8] # Isotropic, even divisionmesh = [16, 4] # Also works well# Poor choices:mesh = [64, 1] # Too anisotropicmesh = [128, 2] # Will have empty cores
Each process writes its local data to separate files. An HDF5 virtual dataset provides a single-file view. Fast and recommended for most parallel simulations.
Weak scaling measures performance when increasing both problem size and processors proportionally:
# Example: Keep ~256^3 points per process# Ideal: Time per iteration stays constant# 8 processes: 256 x 256 x 256 grid# 64 processes: 512 x 512 x 512 grid# 512 processes: 1024 x 1024 x 1024 grid
The dedalus.cfg file includes parallelism settings:
[parallelism]# Default transpose library (fftw, mpi)TRANSPOSE_LIBRARY = fftw# Place MPI Barriers before each transpose callSYNC_TRANSPOSES = False# Transpose multiple fields together when possibleGROUP_TRANSPOSES = True