Skip to main content
Dedalus uses MPI (Message Passing Interface) for distributed-memory parallelization, enabling large-scale simulations across multiple processors and nodes. The framework automatically handles domain decomposition and data distribution.

The Distributor

The Distributor object directs parallelized distribution and transformation of fields. It manages how D-dimensional data fields are split over an R-dimensional mesh of MPI processes, where R < D.

Basic Setup

import dedalus.public as d3
import numpy as np
from mpi4py import MPI

# Default: 1D process mesh using all available processes
coords = d3.CartesianCoordinates('x', 'y', 'z')
dist = d3.Distributor(coords, dtype=np.float64)

# Get MPI information
comm = dist.comm
rank = comm.rank
size = comm.size
By default, Dedalus creates a 1D process mesh using all available MPI processes (MPI.COMM_WORLD).

Process Meshes

For multi-dimensional problems, you can specify a multi-dimensional process mesh for better load balancing:
# 2D process mesh for 3D problem
coords = d3.CartesianCoordinates('x', 'y', 'z')
mesh = [4, 4]  # 4x4 = 16 total processes
dist = d3.Distributor(coords, dtype=np.float64, mesh=mesh)
The mesh must have dimension one less than the problem dimension. For a 3D problem, use a 2D mesh. For a 2D problem, use a 1D mesh.

Domain Decomposition

Dedalus uses multidimensional block distributions when decomposing domains in parallel.

How It Works

In coefficient space:
  • The first R dimensions are distributed over the mesh
  • The last (D-R) dimensions remain local to each process
During transforms to grid space:
  • Dedalus loops backwards over dimensions
  • Performs transforms when dimensions are local
  • Performs MPI transposes to move data between processes
In grid space:
  • The first dimension is local
  • Followed by R distributed dimensions
  • The last (D-R-1) dimensions are local

Distribution Diagram

For a 3D problem with a 2D mesh:
Coefficient Space:    [distributed, distributed, local]

              Transform + Transpose Operations

Grid Space:           [local, distributed, distributed]

Running Parallel Simulations

Serial Execution

python3 simulation.py

Parallel Execution

# Using mpiexec
mpiexec -n 16 python3 simulation.py

# Using mpirun
mpirun -n 16 python3 simulation.py

# On SLURM clusters
srun -n 16 python3 simulation.py
Make sure your Dedalus installation is built with MPI support. Run python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.size)" to verify.

Process Mesh Selection

Guidelines for Optimal Performance

1
Choose mesh dimensions wisely
2
For a 3D problem with shape (Nx, Ny, Nz), the mesh shape should have length 2.
3
Ensure even divisibility
4
Ideal load balancing occurs when basis sizes are evenly divisible by mesh sizes. For example, with Nx = 256 and Ny = 256, a mesh of [16, 16] is ideal.
5
Prefer isotropic meshes
6
For a given number of processes, an “isotropic” mesh with similar values in each dimension is theoretically most efficient (e.g., [8, 8] rather than [4, 16] for 64 processes).
7
Avoid empty cores
8
Ensure no process receives zero data. For example, a mesh of [128, 2] for a problem with shape (64, 64, 64) will result in many empty cores.

Example: Choosing a Mesh

# 3D problem: 256 x 256 x 128 grid
Nx, Ny, Nz = 256, 256, 128

# Running on 64 processes
# Good choices:
mesh = [8, 8]     # Isotropic, even division
mesh = [16, 4]    # Also works well

# Poor choices:
mesh = [64, 1]    # Too anisotropic
mesh = [128, 2]   # Will have empty cores

Load Balancing

Checking Distribution

You can check how data is distributed across processes:
import dedalus.public as d3
import numpy as np

coords = d3.CartesianCoordinates('x', 'y', 'z')
dist = d3.Distributor(coords, dtype=np.float64, mesh=[4, 4])

bases = (d3.RealFourier(coords['x'], size=256, bounds=(0, 4)),
         d3.RealFourier(coords['y'], size=256, bounds=(0, 4)),
         d3.ChebyshevT(coords['z'], size=128, bounds=(0, 1)))

# Create a field
u = dist.VectorField(coords, bases=bases)

# Check local shape in grid space
u['g']  # Transform to grid space
if dist.comm.rank == 0:
    print(f"Local shape on rank 0: {u['g'].shape}")

Avoiding Empty Cores

Empty cores waste resources! Consider a problem with global shape N = (64, 64, 64) distributed on P = 256 cores.
  • Bad mesh [128, 2] or [2, 128]: Many empty cores
  • Good mesh [16, 16]: All cores have data

Parallel I/O

Dedalus provides three parallel I/O modes for output:
snapshots = solver.evaluator.add_file_handler(
    'snapshots', 
    sim_dt=0.25, 
    max_writes=50,
    parallel='gather'
)
All processes send data to rank 0, which writes a single file. Simple but can be slow for large simulations.

Merging Virtual Files

After simulation, merge virtual files for easier post-processing:
from dedalus.tools import post

post.merge_analysis('snapshots')
Or use the command-line tool:
python3 -m dedalus merge snapshots

Performance Scaling

Strong Scaling

Strong scaling measures speedup when increasing processors for a fixed problem size:
# Example: 512^3 grid on varying processor counts
# Ideal: 2x processors → 2x speedup

# 64 processes:  ~100 iterations/hour
# 128 processes: ~200 iterations/hour (good)
# 256 processes: ~350 iterations/hour (diminishing returns)
Strong scaling eventually plateaus due to communication overhead. Choose the number of processes based on the “knee” in the scaling curve.

Weak Scaling

Weak scaling measures performance when increasing both problem size and processors proportionally:
# Example: Keep ~256^3 points per process
# Ideal: Time per iteration stays constant

# 8 processes:   256 x 256 x 256 grid
# 64 processes:  512 x 512 x 512 grid
# 512 processes: 1024 x 1024 x 1024 grid

Configuration Options

The dedalus.cfg file includes parallelism settings:
[parallelism]
# Default transpose library (fftw, mpi)
TRANSPOSE_LIBRARY = fftw

# Place MPI Barriers before each transpose call
SYNC_TRANSPOSES = False

# Transpose multiple fields together when possible
GROUP_TRANSPOSES = True
See Configuration for more details.

Example: Complete Parallel Setup

import numpy as np
import dedalus.public as d3
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size

# Problem parameters
Nx, Ny, Nz = 512, 512, 256

# Choose mesh based on number of processes
if size == 64:
    mesh = [8, 8]
elif size == 128:
    mesh = [16, 8]
elif size == 256:
    mesh = [16, 16]
else:
    mesh = None  # Use default 1D mesh

# Setup distributor
coords = d3.CartesianCoordinates('x', 'y', 'z')
dist = d3.Distributor(coords, dtype=np.float64, mesh=mesh)

if rank == 0:
    print(f"Running on {size} processes with mesh {dist.mesh}")

# Build bases and fields
bases = (d3.RealFourier(coords['x'], size=Nx, bounds=(0, 4)),
         d3.RealFourier(coords['y'], size=Ny, bounds=(0, 4)),
         d3.ChebyshevT(coords['z'], size=Nz, bounds=(0, 1)))

u = dist.VectorField(coords, bases=bases)
p = dist.Field(bases=bases)

# File handler with virtual parallel I/O
snapshots = solver.evaluator.add_file_handler(
    'snapshots', 
    sim_dt=0.1, 
    max_writes=100,
    parallel='virtual'
)
snapshots.add_task(u, name='velocity')
snapshots.add_task(p, name='pressure')

Best Practices

Use Appropriate Mesh

Choose mesh dimensions that evenly divide your basis sizes for optimal load balancing.

Profile Your Code

Test different mesh configurations to find the best performance for your problem.

Monitor Scaling

Check strong/weak scaling to determine the optimal processor count.

Use Virtual I/O

For large parallel runs, use parallel='virtual' for file output.

See Also

Build docs developers (and LLMs) love