Performance Tips

Optimizing Dedalus simulations involves careful configuration of the numerical stack, efficient problem formulation, and smart timestepping strategies. This guide covers best practices for maximum performance.

Stack Configuration

Disable Multithreading

Dedalus does not fully implement hybrid parallelism, so the best performance is typically seen when there is one MPI process per core. Some underlying components may attempt to use multiple threads, which can substantially degrade performance.

Always disable threading before running Dedalus simulations.

Set these environment variables before launching your simulation:

export OMP_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1

If you’re using a conda environment built with the Dedalus installation script, these variables are automatically set when you activate the environment.

Efficient Discretizations

Resolutions for Faster Transforms

The transforms for Fourier and Chebyshev bases use fast Fourier transforms (FFTs). FFT algorithms are most efficient when transform sizes are products of small primes.

Recommended
Avoid

# Powers of 2 (fastest)
Nx = 256    # 2^8
Ny = 512    # 2^9
Nz = 1024   # 2^10

# Powers of 2 with small factors
Nx = 384    # 2^7 × 3
Ny = 768    # 2^8 × 3
Nz = 640    # 2^7 × 5

# Large prime factors (slow!)
Nx = 257    # Prime number
Ny = 509    # Prime number
Nz = 997    # Prime number

# Numbers with large prime factors
Nx = 253    # 11 × 23
Ny = 437    # 19 × 23

Quick rule: Use N = 2^n or N = 2^n × 3^m for fastest transforms.

Process Meshes for Better Load Balancing

By default, Dedalus uses a 1D process mesh. For 3D problems with many processors, specify a 2D mesh:

import dedalus.public as d3
import numpy as np

# 3D problem with 2D mesh
coords = d3.CartesianCoordinates('x', 'y', 'z')
mesh = [8, 8]  # For 64 processes
dist = d3.Distributor(coords, dtype=np.float64, mesh=mesh)

# Create bases
Nx, Ny, Nz = 512, 512, 256
bases = (d3.RealFourier(coords['x'], size=Nx, bounds=(0, 4)),
         d3.RealFourier(coords['y'], size=Ny, bounds=(0, 4)),
         d3.ChebyshevT(coords['z'], size=Nz, bounds=(0, 1)))

Guidelines for mesh selection:

Mesh shape should have length one less than problem dimension
Choose mesh sizes that are powers of 2 or products of small primes
Prefer “isotropic” meshes: [8, 8] over [64, 1] for 64 processes
Ensure basis sizes are divisible by mesh sizes for ideal load balancing

Example: Mesh Selection Strategy

# Problem: 512 x 512 x 256 grid on 128 processes

# Excellent mesh (isotropic, even division)
mesh = [16, 8]  # 512/16 = 32, 512/8 = 64 ✓

# Good mesh
mesh = [32, 4]  # 512/32 = 16, 512/4 = 128 ✓

# Poor mesh (too anisotropic)
mesh = [128, 1]  # Load imbalance

# Bad mesh (will have empty cores!)
mesh = [64, 32]  # 512/32 = 16, but mesh too large for Nz=256

Avoid Empty Cores

Empty cores waste computational resources. This occurs when the mesh is too large relative to the problem size.

Example of empty cores:Problem: (64, 64, 64) grid on 256 processes with mesh [128, 2]Result: Many processes receive no data and sit idle!Better: Use mesh [16, 16] so all cores have work.

Problem Formulation

Minimize the Number of Problem Variables

The number of variables has a large impact on simulation performance. Use as few variables as possible within the constraint that PDEs must be first-order in time.

# Fields
p = dist.Field(name='p', bases=(xbasis,zbasis))
b = dist.Field(name='b', bases=(xbasis,zbasis))
u = dist.VectorField(coords, name='u', bases=(xbasis,zbasis))
tau_p = dist.Field(name='tau_p')
tau_b1 = dist.Field(name='tau_b1', bases=xbasis)
tau_b2 = dist.Field(name='tau_b2', bases=xbasis)
tau_u1 = dist.VectorField(coords, name='tau_u1', bases=xbasis)
tau_u2 = dist.VectorField(coords, name='tau_u2', bases=xbasis)

# Use substitutions instead of additional variables
lift_basis = zbasis.derivative_basis(1)
lift = lambda A: d3.Lift(A, lift_basis, -1)
grad_u = d3.grad(u) + ez*lift(tau_u1)
grad_b = d3.grad(b) + ez*lift(tau_b1)

Avoid Non-Smooth or Rational-Function NCCs

Non-constant coefficients (NCCs) on the left-hand side should be spectrally smooth for good performance.

Good NCCs
Problematic NCCs

# Low-degree polynomials
problem.add_equation("x*dx(u) = f")
problem.add_equation("(1 + x**2)*dx(u) = f")

# Smooth coefficient fields
kappa = dist.Field(name='kappa', bases=bases)
kappa['g'] = 1.0 + 0.1*np.sin(2*np.pi*x)  # Smooth
problem.add_equation("kappa*dx(u) = f")

# Rational functions - multiply through!
# Bad:
problem.add_equation("(1/r)*dx(u) = f")

# Good - cleared denominator:
problem.add_equation("r*dx(u) = r*f")

# Non-smooth fields
kappa = dist.Field(name='kappa', bases=bases)
kappa['g'] = np.where(x < 0.5, 1.0, 2.0)  # Discontinuous!

You can control NCC truncation with solver keywords:

solver = problem.build_solver(timestepper, 
                              ncc_cutoff=1e-10,
                              max_ncc_terms=20)

Clear Polynomial Denominators

Problems with rational-function NCCs should be multiplied through to clear denominators:

# Disk/cylinder coordinates with r in denominator
# Instead of: (1/r)*∂_φ(u) + ...
# Write as:   ∂_φ(u) + r*(...)

# Bad (high bandwidth):
problem.add_equation("(1/r)*dφ(u) + lap(u) = f")

# Good (polynomial NCCs):
problem.add_equation("dφ(u) + r*lap(u) = r*f")

Timestepping

Avoid Changing the Timestep Unnecessarily

Changing the simulation timestep requires refactorizing the LHS matrices, which is expensive. Keep the timestep constant when possible.

import dedalus.public as d3

# CFL with threshold to prevent small changes
CFL = d3.CFL(solver, initial_dt=max_timestep, cadence=10, 
             safety=0.5, 
             threshold=0.05,  # Don't change dt unless it differs by >5%
             max_change=1.5, 
             min_change=0.5, 
             max_dt=max_timestep)
CFL.add_velocity(u)

The threshold parameter is crucial! Set it to 0.05-0.1 to prevent timestep changes due to small flow variations.

Choose Appropriate Timesteppers

RK222

Fast and stable - Good default choice for most problems. Second-order accurate.

timestepper = d3.RK222

RK443

Higher accuracy - Fourth-order accurate but requires more evaluations per step.

timestepper = d3.RK443

SBDF2

Implicit methods - Better for stiff problems. Fewer steps but more expensive.

timestepper = d3.SBDF2

RKSMR

Split methods - For problems with multiple timescales (e.g., rotation + diffusion).

timestepper = d3.RKSMR

Profiling and Benchmarking

Enable Profiling

Dedalus includes built-in profiling:

# Build solver with profiling enabled
solver = problem.build_solver(d3.RK222)
solver.profile_setup = True

# Run simulation...

# Print profiling statistics
solver.log_stats()

Or enable via configuration:

from dedalus.tools.config import config
config['profiling']['PROFILE_DEFAULT'] = 'True'

Benchmark Different Configurations

Run short test simulations

Test 100-1000 iterations with different mesh configurations

Measure iterations per second

import time

start_time = time.time()
for i in range(1000):
    dt = CFL.compute_timestep()
    solver.step(dt)
end_time = time.time()

if dist.comm.rank == 0:
    print(f"Iterations/sec: {1000/(end_time-start_time):.2f}")

Compare strong scaling

Test 64, 128, 256 processes for the same problem size

Check memory usage

import resource
mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print(f"Rank {dist.comm.rank}: {mem/1024:.1f} MB")

Quick Performance Checklist

Environment Configuration

OMP_NUM_THREADS=1 is set
NUMEXPR_MAX_THREADS=1 is set
Using FFTW (not scipy) for transforms
MPI library properly configured

Resolution and Mesh

Basis sizes are powers of 2 or 2^n × 3^m
Using 2D mesh for 3D problems with many cores
Mesh is isotropic (e.g., [8,8] not [64,1])
No empty cores (mesh not too large)
Good load balancing (basis sizes divisible by mesh)

Problem Formulation

Minimized number of problem variables
Using substitutions instead of first-order reductions
LHS NCCs are smooth (not discontinuous)
Rational functions cleared (no denominators)
NCC cutoff parameters tuned if needed

Timestepping

CFL threshold > 0 to prevent unnecessary timestep changes
Appropriate timestepper chosen (RK222 is good default)
Not changing timestep every iteration

Configuration File Settings

Optimize settings in your dedalus.cfg:

[transforms]
DEFAULT_LIBRARY = fftw  # Use FFTW, not scipy
GROUP_TRANSFORMS = False

[transforms-fftw]
PLANNING_RIGOR = measure  # Balance planning time and performance

[parallelism]
TRANSPOSE_LIBRARY = fftw
SYNC_TRANSPOSES = False
GROUP_TRANSPOSES = True

[memory]
STORE_OUTPUTS = True
STORE_LAST_DEFAULT = True

[matrix construction]
STORE_EXPANDED_MATRICES = False  # Set True to trade memory for speed

See Configuration for details on all options.

Get Started

Core Concepts

Advanced Topics

Examples

Performance Tips

Stack Configuration

Disable Multithreading

Efficient Discretizations

Resolutions for Faster Transforms

Process Meshes for Better Load Balancing

Avoid Empty Cores

Problem Formulation

Minimize the Number of Problem Variables

Avoid Non-Smooth or Rational-Function NCCs

Clear Polynomial Denominators

Timestepping

Avoid Changing the Timestep Unnecessarily

Choose Appropriate Timesteppers

RK222

RK443

SBDF2

RKSMR

Profiling and Benchmarking

Enable Profiling

Benchmark Different Configurations

Quick Performance Checklist

Configuration File Settings

See Also

Build docs developers (and LLMs) love

Get Started

Core Concepts

Advanced Topics

Examples

​Stack Configuration

​Disable Multithreading

​Efficient Discretizations

​Resolutions for Faster Transforms

​Process Meshes for Better Load Balancing

​Avoid Empty Cores

​Problem Formulation

​Minimize the Number of Problem Variables

​Avoid Non-Smooth or Rational-Function NCCs

​Clear Polynomial Denominators

​Timestepping

​Avoid Changing the Timestep Unnecessarily

​Choose Appropriate Timesteppers

RK222

RK443

SBDF2

RKSMR

​Profiling and Benchmarking

​Enable Profiling

​Benchmark Different Configurations

​Quick Performance Checklist

​Configuration File Settings

​See Also

Build docs developers (and LLMs) love

Stack Configuration

Disable Multithreading

Efficient Discretizations

Resolutions for Faster Transforms

Process Meshes for Better Load Balancing

Avoid Empty Cores

Problem Formulation

Minimize the Number of Problem Variables

Avoid Non-Smooth or Rational-Function NCCs

Clear Polynomial Denominators

Timestepping

Avoid Changing the Timestep Unnecessarily

Choose Appropriate Timesteppers

Profiling and Benchmarking

Enable Profiling

Benchmark Different Configurations

Quick Performance Checklist

Configuration File Settings

See Also