Neural Autoregressive Flow (NAF)

Overview

Neural Autoregressive Flow (NAF) uses monotonic neural networks (MNN) to create universal function approximators for autoregressive transformations. Unlike MAF which uses simple affine transformations, NAF can represent arbitrary monotonic functions.

Invertibility is only guaranteed for features within the interval [-10, 10]. It is recommended to standardize features (zero mean, unit variance) before training.

Reference

Neural Autoregressive Flows (Huang et al., 2018)
https://arxiv.org/abs/1804.00779

Class Definition

zuko.flows.NAF(
    features: int,
    context: int = 0,
    transforms: int = 3,
    randperm: bool = False,
    signal: int = 16,
    network: dict = {},
    **kwargs
)

Parameters

features

int

required

The number of features in the data.

context

int

default:"0"

The number of context features for conditional density estimation.

transforms

int

default:"3"

The number of autoregressive transformations to stack.

randperm

bool

default:"False"

Whether features are randomly permuted between transformations. If False, features alternate between ascending and descending order.

signal

int

default:"16"

The number of signal features for the monotonic neural network. Higher values increase expressivity but add computational cost.

network

dict

default:"{}"

Keyword arguments passed to the MNN (monotonic neural network) constructor:

hidden_features: Hidden layer sizes for the monotonic network
activation: Activation function

**kwargs

dict

Additional keyword arguments passed to MaskedAutoregressiveTransform:

hidden_features: Hidden layer sizes for the autoregressive network
activation: Activation function

Usage Example

import torch
import zuko

# Create an unconditional NAF
flow = zuko.flows.NAF(
    features=5,
    transforms=5,
    signal=32,
    hidden_features=[128, 128]
)

# Sample from the flow
dist = flow()
samples = dist.sample((1000,))
print(samples.shape)  # torch.Size([1000, 5])

# Compute log probabilities
log_prob = dist.log_prob(samples)
print(log_prob.shape)  # torch.Size([1000])

Conditional Flow

# Create a conditional NAF
flow = zuko.flows.NAF(
    features=3,
    context=5,
    transforms=5,
    signal=24
)

context = torch.randn(5)
dist = flow(context)
samples = dist.sample((100,))

Training Example

import torch.optim as optim

# Create flow with custom network configurations
flow = zuko.flows.NAF(
    features=10,
    transforms=5,
    signal=32,
    hidden_features=[256, 256],  # Autoregressive network
    network={'hidden_features': [64, 64, 64]}  # Monotonic network
)

optimizer = optim.Adam(flow.parameters(), lr=1e-3)

for epoch in range(100):
    for x in dataloader:
        optimizer.zero_grad()
        
        # Ensure data is in [-10, 10]
        x = torch.clamp(x, -10, 10)
        
        loss = -flow().log_prob(x).mean()
        loss.backward()
        optimizer.step()
    
    print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

Methods

`forward(c=None)`

Returns a normalizing flow distribution. Arguments:

c (Tensor, optional): Context tensor of shape (*, context)

Returns:

NormalizingFlow: A distribution with:
- sample(shape): Sample from the distribution
- log_prob(x): Compute log probability of samples
- rsample(shape): Reparameterized sampling

When to Use NAF

Good for:

Complex, highly nonlinear distributions
High-dimensional data
When you need universal approximation
Maximum expressivity in autoregressive flows

Consider alternatives if:

You need fast sampling (use RealNVP)
Your data is outside [-10, 10] and can’t be standardized
You have limited compute (use MAF or NSF)
You need smooth, well-behaved transformations (use NSF)

Tips

Standardize your data: NAF requires features in [-10, 10]. Always normalize to zero mean and unit variance.
Tune signal dimension: Start with signal=16. Increase to 32 or 64 for more complex data.
Use softclip: NAF automatically includes SoftclipTransform layers between transformations to keep values bounded.
Balance network sizes: The autoregressive network (hidden_features) predicts signals, while the monotonic network (network['hidden_features']) performs transformations.

Architecture Details

NAF consists of:

Base distribution: Diagonal Gaussian N(0, I)
Transformation: Monotonic neural networks with autoregressive structure
Signal network: Masked MLP predicts signal vectors autoregressively
Monotonic network: MLP with positive weights computes transformations
Softclip layers: Inserted between transformations to maintain bounds

Each transformation:

y_i = MNN(x_i; signal_i(x_1, ..., x_{i-1}, c))

where MNN is a monotonic neural network and signal_i is predicted autoregressively.

Monotonic Neural Networks

The key innovation in NAF is the use of monotonic neural networks:

Positive weights: All weights in the network are positive, ensuring monotonicity
Flexible: Can approximate any continuous monotonic function
Signal-based: Behavior is modulated by signal vectors rather than changing weights

# Simplified monotonic network
class MNN(nn.Module):
    def __init__(self, signal_dim):
        # All layers have positive weights
        self.layers = MonotonicMLP(1 + signal_dim, 1)
    
    def forward(self, x, signal):
        # Concatenate input with signal
        inp = torch.cat([x, signal], dim=-1)
        return self.layers(inp)

NAF vs Other Flows

Property	NAF	MAF	NSF
Transformation	Neural network	Affine	Spline
Expressivity	Very high	Medium	High
Training speed	Slow	Fast	Medium
Sampling speed	Slow	Slow	Slow
Memory usage	High	Low	Medium
Domain	`[-10, 10]`	Unbounded	`[-5, 5]`

Advanced Usage

Custom Monotonic Network

flow = zuko.flows.NAF(
    features=10,
    transforms=5,
    signal=24,
    network={
        'hidden_features': [128, 128, 128],
        'activation': nn.ELU  # Different activation
    }
)

High-Dimensional Data

# For high dimensions, use coupling
flow = zuko.flows.NAF(
    features=100,
    transforms=3,
    passes=2,  # Coupling instead of fully autoregressive
    signal=32
)

Fine-Grained Control

from zuko.flows.autoregressive import MaskedAutoregressiveTransform
from zuko.flows.neural import MNN

# Manually construct NAF-like flow
transform = MaskedAutoregressiveTransform(
    features=10,
    context=5,
    univariate=MNN(signal=32, hidden_features=[64, 64, 64]),
    shapes=[(32,)],  # Signal shape
    hidden_features=[256, 256]
)

Computational Considerations

NAF is computationally expensive:

Parameters: More than MAF due to monotonic networks
Forward pass: Slower due to neural network evaluations
Memory: Higher due to signal vectors and network activations

Optimization strategies:

Use smaller signal dimensions (8-16)
Use coupling (passes=2) for high dimensions
Reduce monotonic network depth
Use mixed precision training

UNAF - Unconstrained variant with integration
MAF - Simpler affine alternative
NSF - Spline-based alternative
MonotonicTransform - The underlying transformation

Flows

Core Components

Distributions

Transforms

Utilities

Neural Autoregressive Flow (NAF)

Overview

Reference

Class Definition

Parameters

Usage Example

Conditional Flow

Training Example

Methods

`forward(c=None)`

When to Use NAF

Tips

Architecture Details

Monotonic Neural Networks

NAF vs Other Flows

Advanced Usage

Custom Monotonic Network

High-Dimensional Data

Fine-Grained Control

Computational Considerations

Build docs developers (and LLMs) love

Flows

Core Components

Distributions

Transforms

Utilities

​Overview

​Reference

​Class Definition

​Parameters

​Usage Example

​Conditional Flow

​Training Example

​Methods

​forward(c=None)

​When to Use NAF

​Tips

​Architecture Details

​Monotonic Neural Networks

​NAF vs Other Flows

​Advanced Usage

​Custom Monotonic Network

​High-Dimensional Data

​Fine-Grained Control

​Computational Considerations

​Related

Build docs developers (and LLMs) love

Overview

Reference

Class Definition

Parameters

Usage Example

Conditional Flow

Training Example

Methods

`forward(c=None)`

When to Use NAF

Tips

Architecture Details

Monotonic Neural Networks

NAF vs Other Flows

Advanced Usage

Custom Monotonic Network

High-Dimensional Data

Fine-Grained Control

Computational Considerations

Related