Because OFT preserves hyperspherical energy (the pairwise angular relationships between weight vectors), it tends to retain the base model’s style and composition while adapting its content. This makes OFT well-suited for style preservation tasks where LoRA might introduce unwanted drift.
Available modules
| Module | File | Target architecture |
|---|---|---|
networks.oft | networks/oft.py | SD 1.x / 2.x (and SDXL) |
networks.oft_flux | networks/oft_flux.py | FLUX.1 |
How OFT differs from LoRA
| Property | LoRA | OFT |
|---|---|---|
| Update type | Low-rank additive delta | Block-diagonal orthogonal rotation |
| Preserves angles | No | Yes |
| Parameter structure | Two low-rank matrices per layer | One block-diagonal matrix per layer |
| Constraint | None (unconstrained delta) | Norm constraint on skew-symmetric matrices |
Typical network_alpha | 1–32 | Small values like 1e-3 |
| Good for | General fine-tuning | Style preservation, conservative adaptation |
OFT for SD 1.x/2.x (networks.oft)
networks.oft targets attention layers in the UNet by default (CrossAttention). You can expand coverage to all linear layers in Transformer2DModel blocks or to Conv2d layers.
The network_dim argument sets the number of orthogonal blocks (not a rank in the LoRA sense). A larger network_dim gives finer-grained rotation matrices but increases parameter count.
network_args for networks.oft
Expand OFT coverage from attention-only (
CrossAttention) to all linear layers inside Transformer2DModel blocks, including feed-forward layers. Increases the number of trained parameters.Also apply OFT to Conv2d layers in
ResnetBlock2D, Downsample2D, and Upsample2D modules. Useful when fine-tuning for textures or styles that are encoded in the ResNet layers.Training example
OFT for FLUX.1 (networks.oft_flux)
networks.oft_flux targets FLUX.1’s DoubleStreamBlock and SingleStreamBlock modules. Because FLUX combines Q, K, and V into a single projection (qkv), oft_flux handles split dimensions automatically — each sub-projection (Q, K, V) gets its own block-diagonal rotation matrix.
The constraint parameter (network_alpha) scales proportionally to the output dimension of each sub-projection rather than the full combined QKV output, which is the primary behavioral difference from networks.oft.
network_args for networks.oft_flux
Expand coverage from attention-only (
SelfAttention) to all linear layers inside DoubleStreamBlock and SingleStreamBlock, including MLP layers.Training example
OFT internals
OFT learns a block-diagonal skew-symmetric matrixQ per layer. The orthogonal rotation matrix R is computed via the Cayley map:
network_alpha) limits the Frobenius norm of Q to prevent the rotation from deviating too far from the identity. At network_alpha = 0, no constraint is applied. At inference, the rotated weight is:
network_dim argument controls how many blocks the output dimension is divided into. Each block has its own block_size × block_size rotation matrix. A larger network_dim means more, smaller blocks — each capturing finer-grained rotations.