Skip to main content
Optimizing Dedalus simulations involves careful configuration of the numerical stack, efficient problem formulation, and smart timestepping strategies. This guide covers best practices for maximum performance.

Stack Configuration

Disable Multithreading

Dedalus does not fully implement hybrid parallelism, so the best performance is typically seen when there is one MPI process per core. Some underlying components may attempt to use multiple threads, which can substantially degrade performance.
Always disable threading before running Dedalus simulations.
Set these environment variables before launching your simulation:
export OMP_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1
If you’re using a conda environment built with the Dedalus installation script, these variables are automatically set when you activate the environment.

Efficient Discretizations

Resolutions for Faster Transforms

The transforms for Fourier and Chebyshev bases use fast Fourier transforms (FFTs). FFT algorithms are most efficient when transform sizes are products of small primes.
Quick rule: Use N = 2^n or N = 2^n × 3^m for fastest transforms.

Process Meshes for Better Load Balancing

By default, Dedalus uses a 1D process mesh. For 3D problems with many processors, specify a 2D mesh:
import dedalus.public as d3
import numpy as np

# 3D problem with 2D mesh
coords = d3.CartesianCoordinates('x', 'y', 'z')
mesh = [8, 8]  # For 64 processes
dist = d3.Distributor(coords, dtype=np.float64, mesh=mesh)

# Create bases
Nx, Ny, Nz = 512, 512, 256
bases = (d3.RealFourier(coords['x'], size=Nx, bounds=(0, 4)),
         d3.RealFourier(coords['y'], size=Ny, bounds=(0, 4)),
         d3.ChebyshevT(coords['z'], size=Nz, bounds=(0, 1)))
Guidelines for mesh selection:
  1. Mesh shape should have length one less than problem dimension
  2. Choose mesh sizes that are powers of 2 or products of small primes
  3. Prefer “isotropic” meshes: [8, 8] over [64, 1] for 64 processes
  4. Ensure basis sizes are divisible by mesh sizes for ideal load balancing
# Problem: 512 x 512 x 256 grid on 128 processes

# Excellent mesh (isotropic, even division)
mesh = [16, 8]  # 512/16 = 32, 512/8 = 64 ✓

# Good mesh
mesh = [32, 4]  # 512/32 = 16, 512/4 = 128 ✓

# Poor mesh (too anisotropic)
mesh = [128, 1]  # Load imbalance

# Bad mesh (will have empty cores!)
mesh = [64, 32]  # 512/32 = 16, but mesh too large for Nz=256

Avoid Empty Cores

Empty cores waste computational resources. This occurs when the mesh is too large relative to the problem size.
Example of empty cores:Problem: (64, 64, 64) grid on 256 processes with mesh [128, 2]Result: Many processes receive no data and sit idle!Better: Use mesh [16, 16] so all cores have work.

Problem Formulation

Minimize the Number of Problem Variables

The number of variables has a large impact on simulation performance. Use as few variables as possible within the constraint that PDEs must be first-order in time.
# Fields
p = dist.Field(name='p', bases=(xbasis,zbasis))
b = dist.Field(name='b', bases=(xbasis,zbasis))
u = dist.VectorField(coords, name='u', bases=(xbasis,zbasis))
tau_p = dist.Field(name='tau_p')
tau_b1 = dist.Field(name='tau_b1', bases=xbasis)
tau_b2 = dist.Field(name='tau_b2', bases=xbasis)
tau_u1 = dist.VectorField(coords, name='tau_u1', bases=xbasis)
tau_u2 = dist.VectorField(coords, name='tau_u2', bases=xbasis)

# Use substitutions instead of additional variables
lift_basis = zbasis.derivative_basis(1)
lift = lambda A: d3.Lift(A, lift_basis, -1)
grad_u = d3.grad(u) + ez*lift(tau_u1)
grad_b = d3.grad(b) + ez*lift(tau_b1)

Avoid Non-Smooth or Rational-Function NCCs

Non-constant coefficients (NCCs) on the left-hand side should be spectrally smooth for good performance.
# Low-degree polynomials
problem.add_equation("x*dx(u) = f")
problem.add_equation("(1 + x**2)*dx(u) = f")

# Smooth coefficient fields
kappa = dist.Field(name='kappa', bases=bases)
kappa['g'] = 1.0 + 0.1*np.sin(2*np.pi*x)  # Smooth
problem.add_equation("kappa*dx(u) = f")
You can control NCC truncation with solver keywords:
solver = problem.build_solver(timestepper, 
                              ncc_cutoff=1e-10,
                              max_ncc_terms=20)

Clear Polynomial Denominators

Problems with rational-function NCCs should be multiplied through to clear denominators:
# Disk/cylinder coordinates with r in denominator
# Instead of: (1/r)*∂_φ(u) + ...
# Write as:   ∂_φ(u) + r*(...)

# Bad (high bandwidth):
problem.add_equation("(1/r)*dφ(u) + lap(u) = f")

# Good (polynomial NCCs):
problem.add_equation("dφ(u) + r*lap(u) = r*f")

Timestepping

Avoid Changing the Timestep Unnecessarily

Changing the simulation timestep requires refactorizing the LHS matrices, which is expensive. Keep the timestep constant when possible.
import dedalus.public as d3

# CFL with threshold to prevent small changes
CFL = d3.CFL(solver, initial_dt=max_timestep, cadence=10, 
             safety=0.5, 
             threshold=0.05,  # Don't change dt unless it differs by >5%
             max_change=1.5, 
             min_change=0.5, 
             max_dt=max_timestep)
CFL.add_velocity(u)
The threshold parameter is crucial! Set it to 0.05-0.1 to prevent timestep changes due to small flow variations.

Choose Appropriate Timesteppers

RK222

Fast and stable - Good default choice for most problems. Second-order accurate.
timestepper = d3.RK222

RK443

Higher accuracy - Fourth-order accurate but requires more evaluations per step.
timestepper = d3.RK443

SBDF2

Implicit methods - Better for stiff problems. Fewer steps but more expensive.
timestepper = d3.SBDF2

RKSMR

Split methods - For problems with multiple timescales (e.g., rotation + diffusion).
timestepper = d3.RKSMR

Profiling and Benchmarking

Enable Profiling

Dedalus includes built-in profiling:
# Build solver with profiling enabled
solver = problem.build_solver(d3.RK222)
solver.profile_setup = True

# Run simulation...

# Print profiling statistics
solver.log_stats()
Or enable via configuration:
from dedalus.tools.config import config
config['profiling']['PROFILE_DEFAULT'] = 'True'

Benchmark Different Configurations

1
Run short test simulations
2
Test 100-1000 iterations with different mesh configurations
3
Measure iterations per second
4
import time

start_time = time.time()
for i in range(1000):
    dt = CFL.compute_timestep()
    solver.step(dt)
end_time = time.time()

if dist.comm.rank == 0:
    print(f"Iterations/sec: {1000/(end_time-start_time):.2f}")
5
Compare strong scaling
6
Test 64, 128, 256 processes for the same problem size
7
Check memory usage
8
import resource
mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Rank {dist.comm.rank}: {mem/1024:.1f} MB")

Quick Performance Checklist

  • OMP_NUM_THREADS=1 is set
  • NUMEXPR_MAX_THREADS=1 is set
  • Using FFTW (not scipy) for transforms
  • MPI library properly configured
  • Basis sizes are powers of 2 or 2^n × 3^m
  • Using 2D mesh for 3D problems with many cores
  • Mesh is isotropic (e.g., [8,8] not [64,1])
  • No empty cores (mesh not too large)
  • Good load balancing (basis sizes divisible by mesh)
  • Minimized number of problem variables
  • Using substitutions instead of first-order reductions
  • LHS NCCs are smooth (not discontinuous)
  • Rational functions cleared (no denominators)
  • NCC cutoff parameters tuned if needed
  • CFL threshold > 0 to prevent unnecessary timestep changes
  • Appropriate timestepper chosen (RK222 is good default)
  • Not changing timestep every iteration

Configuration File Settings

Optimize settings in your dedalus.cfg:
[transforms]
DEFAULT_LIBRARY = fftw  # Use FFTW, not scipy
GROUP_TRANSFORMS = False

[transforms-fftw]
PLANNING_RIGOR = measure  # Balance planning time and performance

[parallelism]
TRANSPOSE_LIBRARY = fftw
SYNC_TRANSPOSES = False
GROUP_TRANSPOSES = True

[memory]
STORE_OUTPUTS = True
STORE_LAST_DEFAULT = True

[matrix construction]
STORE_EXPANDED_MATRICES = False  # Set True to trade memory for speed
See Configuration for details on all options.

See Also

Build docs developers (and LLMs) love