Stack Configuration
Disable Multithreading
Dedalus does not fully implement hybrid parallelism, so the best performance is typically seen when there is one MPI process per core. Some underlying components may attempt to use multiple threads, which can substantially degrade performance. Set these environment variables before launching your simulation:If you’re using a conda environment built with the Dedalus installation script, these variables are automatically set when you activate the environment.
Efficient Discretizations
Resolutions for Faster Transforms
The transforms for Fourier and Chebyshev bases use fast Fourier transforms (FFTs). FFT algorithms are most efficient when transform sizes are products of small primes.- Recommended
- Avoid
Process Meshes for Better Load Balancing
By default, Dedalus uses a 1D process mesh. For 3D problems with many processors, specify a 2D mesh:- Mesh shape should have length one less than problem dimension
- Choose mesh sizes that are powers of 2 or products of small primes
- Prefer “isotropic” meshes:
[8, 8]over[64, 1]for 64 processes - Ensure basis sizes are divisible by mesh sizes for ideal load balancing
Example: Mesh Selection Strategy
Example: Mesh Selection Strategy
Avoid Empty Cores
Empty cores waste computational resources. This occurs when the mesh is too large relative to the problem size.Problem Formulation
Minimize the Number of Problem Variables
The number of variables has a large impact on simulation performance. Use as few variables as possible within the constraint that PDEs must be first-order in time.Avoid Non-Smooth or Rational-Function NCCs
Non-constant coefficients (NCCs) on the left-hand side should be spectrally smooth for good performance.- Good NCCs
- Problematic NCCs
You can control NCC truncation with solver keywords:
Clear Polynomial Denominators
Problems with rational-function NCCs should be multiplied through to clear denominators:Timestepping
Avoid Changing the Timestep Unnecessarily
Changing the simulation timestep requires refactorizing the LHS matrices, which is expensive. Keep the timestep constant when possible.Choose Appropriate Timesteppers
RK222
Fast and stable - Good default choice for most problems. Second-order accurate.
RK443
Higher accuracy - Fourth-order accurate but requires more evaluations per step.
SBDF2
Implicit methods - Better for stiff problems. Fewer steps but more expensive.
RKSMR
Split methods - For problems with multiple timescales (e.g., rotation + diffusion).
Profiling and Benchmarking
Enable Profiling
Dedalus includes built-in profiling:Benchmark Different Configurations
import time
start_time = time.time()
for i in range(1000):
dt = CFL.compute_timestep()
solver.step(dt)
end_time = time.time()
if dist.comm.rank == 0:
print(f"Iterations/sec: {1000/(end_time-start_time):.2f}")
Quick Performance Checklist
Environment Configuration
Environment Configuration
-
OMP_NUM_THREADS=1is set -
NUMEXPR_MAX_THREADS=1is set - Using FFTW (not scipy) for transforms
- MPI library properly configured
Resolution and Mesh
Resolution and Mesh
- Basis sizes are powers of 2 or 2^n × 3^m
- Using 2D mesh for 3D problems with many cores
- Mesh is isotropic (e.g., [8,8] not [64,1])
- No empty cores (mesh not too large)
- Good load balancing (basis sizes divisible by mesh)
Problem Formulation
Problem Formulation
- Minimized number of problem variables
- Using substitutions instead of first-order reductions
- LHS NCCs are smooth (not discontinuous)
- Rational functions cleared (no denominators)
- NCC cutoff parameters tuned if needed
Timestepping
Timestepping
- CFL threshold > 0 to prevent unnecessary timestep changes
- Appropriate timestepper chosen (RK222 is good default)
- Not changing timestep every iteration
Configuration File Settings
Optimize settings in yourdedalus.cfg:
See Also
- Parallel Computing - MPI parallelization and domain decomposition
- Configuration - Detailed configuration options
- Tau Method - Efficient problem formulation