Parallel Computing

Dedalus uses MPI (Message Passing Interface) for distributed-memory parallelization, enabling large-scale simulations across multiple processors and nodes. The framework automatically handles domain decomposition and data distribution.

The Distributor

The Distributor object directs parallelized distribution and transformation of fields. It manages how D-dimensional data fields are split over an R-dimensional mesh of MPI processes, where R < D.

Basic Setup

import dedalus.public as d3
import numpy as np
from mpi4py import MPI

# Default: 1D process mesh using all available processes
coords = d3.CartesianCoordinates('x', 'y', 'z')
dist = d3.Distributor(coords, dtype=np.float64)

# Get MPI information
comm = dist.comm
rank = comm.rank
size = comm.size

By default, Dedalus creates a 1D process mesh using all available MPI processes (MPI.COMM_WORLD).

Process Meshes

For multi-dimensional problems, you can specify a multi-dimensional process mesh for better load balancing:

# 2D process mesh for 3D problem
coords = d3.CartesianCoordinates('x', 'y', 'z')
mesh = [4, 4]  # 4x4 = 16 total processes
dist = d3.Distributor(coords, dtype=np.float64, mesh=mesh)

The mesh must have dimension one less than the problem dimension. For a 3D problem, use a 2D mesh. For a 2D problem, use a 1D mesh.

Domain Decomposition

Dedalus uses multidimensional block distributions when decomposing domains in parallel.

How It Works

In coefficient space:

The first R dimensions are distributed over the mesh
The last (D-R) dimensions remain local to each process

During transforms to grid space:

Dedalus loops backwards over dimensions
Performs transforms when dimensions are local
Performs MPI transposes to move data between processes

In grid space:

The first dimension is local
Followed by R distributed dimensions
The last (D-R-1) dimensions are local

Distribution Diagram

For a 3D problem with a 2D mesh:

Coefficient Space:    [distributed, distributed, local]
                                ↓
              Transform + Transpose Operations
                                ↓
Grid Space:           [local, distributed, distributed]

Running Parallel Simulations

Serial Execution

python3 simulation.py

Parallel Execution

# Using mpiexec
mpiexec -n 16 python3 simulation.py

# Using mpirun
mpirun -n 16 python3 simulation.py

# On SLURM clusters
srun -n 16 python3 simulation.py

Make sure your Dedalus installation is built with MPI support. Run python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.size)" to verify.

Process Mesh Selection

Guidelines for Optimal Performance

Choose mesh dimensions wisely

For a 3D problem with shape (Nx, Ny, Nz), the mesh shape should have length 2.

Ensure even divisibility

Ideal load balancing occurs when basis sizes are evenly divisible by mesh sizes. For example, with Nx = 256 and Ny = 256, a mesh of [16, 16] is ideal.

Prefer isotropic meshes

For a given number of processes, an “isotropic” mesh with similar values in each dimension is theoretically most efficient (e.g., [8, 8] rather than [4, 16] for 64 processes).

Avoid empty cores

Ensure no process receives zero data. For example, a mesh of [128, 2] for a problem with shape (64, 64, 64) will result in many empty cores.

Example: Choosing a Mesh

# 3D problem: 256 x 256 x 128 grid
Nx, Ny, Nz = 256, 256, 128

# Running on 64 processes
# Good choices:
mesh = [8, 8]     # Isotropic, even division
mesh = [16, 4]    # Also works well

# Poor choices:
mesh = [64, 1]    # Too anisotropic
mesh = [128, 2]   # Will have empty cores

Load Balancing

Checking Distribution

You can check how data is distributed across processes:

import dedalus.public as d3
import numpy as np

coords = d3.CartesianCoordinates('x', 'y', 'z')
dist = d3.Distributor(coords, dtype=np.float64, mesh=[4, 4])

bases = (d3.RealFourier(coords['x'], size=256, bounds=(0, 4)),
         d3.RealFourier(coords['y'], size=256, bounds=(0, 4)),
         d3.ChebyshevT(coords['z'], size=128, bounds=(0, 1)))

# Create a field
u = dist.VectorField(coords, bases=bases)

# Check local shape in grid space
u['g']  # Transform to grid space
if dist.comm.rank == 0:
    print(f"Local shape on rank 0: {u['g'].shape}")

Avoiding Empty Cores

Empty cores waste resources! Consider a problem with global shape N = (64, 64, 64) distributed on P = 256 cores.

Bad mesh [128, 2] or [2, 128]: Many empty cores
Good mesh [16, 16]: All cores have data

Parallel I/O

Dedalus provides three parallel I/O modes for output:

gather (default for serial)
virtual (default for parallel)
mpio (requires parallel HDF5)

snapshots = solver.evaluator.add_file_handler(
    'snapshots', 
    sim_dt=0.25, 
    max_writes=50,
    parallel='gather'
)

All processes send data to rank 0, which writes a single file. Simple but can be slow for large simulations.

snapshots = solver.evaluator.add_file_handler(
    'snapshots', 
    sim_dt=0.25, 
    max_writes=50,
    parallel='virtual'
)

Each process writes its local data to separate files. An HDF5 virtual dataset provides a single-file view. Fast and recommended for most parallel simulations.

snapshots = solver.evaluator.add_file_handler(
    'snapshots', 
    sim_dt=0.25, 
    max_writes=50,
    parallel='mpio'
)

Uses MPI-IO for collective parallel writes. Requires HDF5 built with parallel support. Can be fastest on some systems.

Merging Virtual Files

After simulation, merge virtual files for easier post-processing:

from dedalus.tools import post

post.merge_analysis('snapshots')

Or use the command-line tool:

python3 -m dedalus merge snapshots

Performance Scaling

Strong Scaling

Strong scaling measures speedup when increasing processors for a fixed problem size:

# Example: 512^3 grid on varying processor counts
# Ideal: 2x processors → 2x speedup

# 64 processes:  ~100 iterations/hour
# 128 processes: ~200 iterations/hour (good)
# 256 processes: ~350 iterations/hour (diminishing returns)

Strong scaling eventually plateaus due to communication overhead. Choose the number of processes based on the “knee” in the scaling curve.

Weak Scaling

Weak scaling measures performance when increasing both problem size and processors proportionally:

# Example: Keep ~256^3 points per process
# Ideal: Time per iteration stays constant

# 8 processes:   256 x 256 x 256 grid
# 64 processes:  512 x 512 x 512 grid
# 512 processes: 1024 x 1024 x 1024 grid

Configuration Options

The dedalus.cfg file includes parallelism settings:

[parallelism]
# Default transpose library (fftw, mpi)
TRANSPOSE_LIBRARY = fftw

# Place MPI Barriers before each transpose call
SYNC_TRANSPOSES = False

# Transpose multiple fields together when possible
GROUP_TRANSPOSES = True

See Configuration for more details.

Example: Complete Parallel Setup

import numpy as np
import dedalus.public as d3
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.rank
size = comm.size

# Problem parameters
Nx, Ny, Nz = 512, 512, 256

# Choose mesh based on number of processes
if size == 64:
    mesh = [8, 8]
elif size == 128:
    mesh = [16, 8]
elif size == 256:
    mesh = [16, 16]
else:
    mesh = None  # Use default 1D mesh

# Setup distributor
coords = d3.CartesianCoordinates('x', 'y', 'z')
dist = d3.Distributor(coords, dtype=np.float64, mesh=mesh)

if rank == 0:
    print(f"Running on {size} processes with mesh {dist.mesh}")

# Build bases and fields
bases = (d3.RealFourier(coords['x'], size=Nx, bounds=(0, 4)),
         d3.RealFourier(coords['y'], size=Ny, bounds=(0, 4)),
         d3.ChebyshevT(coords['z'], size=Nz, bounds=(0, 1)))

u = dist.VectorField(coords, bases=bases)
p = dist.Field(bases=bases)

# File handler with virtual parallel I/O
snapshots = solver.evaluator.add_file_handler(
    'snapshots', 
    sim_dt=0.1, 
    max_writes=100,
    parallel='virtual'
)
snapshots.add_task(u, name='velocity')
snapshots.add_task(p, name='pressure')

Best Practices

Use Appropriate Mesh

Choose mesh dimensions that evenly divide your basis sizes for optimal load balancing.

Profile Your Code

Test different mesh configurations to find the best performance for your problem.

Monitor Scaling

Check strong/weak scaling to determine the optimal processor count.

Use Virtual I/O

For large parallel runs, use parallel='virtual' for file output.

Get Started

Core Concepts

Advanced Topics

Examples

Parallel Computing

The Distributor

Basic Setup

Process Meshes

Domain Decomposition

How It Works

Distribution Diagram

Running Parallel Simulations

Serial Execution

Parallel Execution

Process Mesh Selection

Guidelines for Optimal Performance

Example: Choosing a Mesh

Load Balancing

Checking Distribution

Avoiding Empty Cores

Parallel I/O

Merging Virtual Files

Performance Scaling

Strong Scaling

Weak Scaling

Configuration Options

Example: Complete Parallel Setup

Best Practices

Use Appropriate Mesh

Profile Your Code

Monitor Scaling

Use Virtual I/O

See Also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Advanced Topics

Examples

​The Distributor

​Basic Setup

​Process Meshes

​Domain Decomposition

​How It Works

​Distribution Diagram

​Running Parallel Simulations

​Serial Execution

​Parallel Execution

​Process Mesh Selection

​Guidelines for Optimal Performance

​Example: Choosing a Mesh

​Load Balancing

​Checking Distribution

​Avoiding Empty Cores

​Parallel I/O

​Merging Virtual Files

​Performance Scaling

​Strong Scaling

​Weak Scaling

​Configuration Options

​Example: Complete Parallel Setup

​Best Practices

Use Appropriate Mesh

Profile Your Code

Monitor Scaling

Use Virtual I/O

​See Also

Build docs developers (and LLMs) love

The Distributor

Basic Setup

Process Meshes

Domain Decomposition

How It Works

Distribution Diagram

Running Parallel Simulations

Serial Execution

Parallel Execution

Process Mesh Selection

Guidelines for Optimal Performance

Example: Choosing a Mesh

Load Balancing

Checking Distribution

Avoiding Empty Cores

Parallel I/O

Merging Virtual Files

Performance Scaling

Strong Scaling

Weak Scaling

Configuration Options

Example: Complete Parallel Setup

Best Practices

See Also