Tango

Tango is Firedancer’s zero-copy, lock-free IPC framework designed for ultra-low-latency message passing between processing units. It provides the foundational messaging primitives used throughout Firedancer for communication between tiles, processes, and system components.

Overview

Tango implements a sophisticated messaging system based on fragments (variable-size message chunks) with metadata-driven routing and filtering. The design enables:

Zero-copy communication: Direct memory access without data copying
Lock-free operations: No mutex contention in critical paths
Ordered delivery: Strict sequential ordering with gap detection
Overrun detection: Automatic detection of lost messages
Multi-producer/multi-consumer: Flexible topology support

Tango’s design allows seamless transition between single-process threaded and multi-process distributed deployments without code changes.

Core Concepts

Message Fragments

Messages in Tango are partitioned into fragments:

Each fragment carries 0-65,535 bytes of payload
Fragments have 64-bit globally unique sequence numbers
Multi-fragment messages are supported (unlimited fragments per message)
Zero-sized fragments are valid (heartbeat/keepalive)

Fragment metadata:

struct fd_frag_meta {
  ulong  seq;     // Sequence number (globally unique)
  ulong  sig;     // Application-defined signature (filtering)
  uint   chunk;   // Compressed payload location
  ushort sz;      // Fragment size (0-65535 bytes)
  ushort ctl;     // Control bits (SOM/EOM/ERR/origin)
  uint   tsorig;  // Origin timestamp (compressed)
  uint   tspub;   // Publish timestamp (compressed)
};

Control Bits

Fragment boundaries:

SOM (Start-of-Message): First fragment of a message
EOM (End-of-Message): Last fragment of a message
ERR (Error): Entire message is corrupt (discard)

Origin tracking:

13-bit origin ID (0-8191)
Identifies message producer
Up to FD_FRAG_META_ORIG_MAX (8192) origins

Sequence Numbers

Properties:

// 64-bit sequence numbers
// - Monotonically increasing across all producers
// - Gaps indicate lost fragments (overrun)
// - Wrapping handled correctly (randomized initial values)

int fd_seq_lt(ulong a, ulong b);  // a < b?
int fd_seq_eq(ulong a, ulong b);  // a == b?
long fd_seq_diff(ulong a, ulong b);  // How far ahead is a?

Sequence number math:

ulong next = fd_seq_inc(seq, 1UL);      // Increment
ulong prev = fd_seq_dec(seq, 1UL);      // Decrement
long diff = fd_seq_diff(seq_a, seq_b);  // Distance

Architecture Components

Fragment Metadata Cache (mcache)

Mcache (`mcache/`)

Hybrid ring buffer + direct-mapped cache for fragment metadata:Structure:

Ring buffer of fd_frag_meta_t entries
Power-of-2 size (typically 4K-1M entries)
Direct mapping: seq % depth → cache line
Lock-free atomic updates (SSE/AVX instructions)

Access patterns:

// Publisher: Write metadata atomically
fd_frag_meta_t * meta = mcache + (seq % depth);
meta->seq = seq;  // Atomic write (first SSE word)

// Consumer: Read metadata atomically
ulong seq = fd_frag_meta_seq_query(meta);
if (fd_seq_gt(seq, last_seq)) {
  // New fragment available
}

Depth sizing:

Small queues: 1K-4K entries (low latency)
Deep queues: 256K-1M entries (burst absorption)
Trade-off: Memory vs. overrun tolerance

Memory footprint:

Each entry: 32 bytes (FD_FRAG_META_SZ)
Aligned to 32 bytes (FD_FRAG_META_ALIGN)
Total: depth * 32 bytes

Data Cache (dcache)

Dcache (`dcache/`)

Payload storage with chunk-granular allocation:Chunk system:

#define FD_CHUNK_SZ     (64UL)  // 64-byte chunks
#define FD_CHUNK_LG_SZ  (6)     // log2(64) = 6

Chunk addressing:

// Convert chunk ID to pointer
void * payload = fd_chunk_to_laddr(chunk0, chunk_id);

// Convert pointer to chunk ID
ulong chunk_id = fd_laddr_to_chunk(chunk0, payload);

Properties:

32-bit chunk IDs (4 billion chunks max)
Chunk 0 is base address
All chunks 64-byte aligned
Max data region: 256 GB per dcache

Allocation strategies:

Linear allocation (sequential writes)
Circular buffer (fixed-size messages)
Custom allocators (variable-size)

Flow Control (fctl)

Fctl (`fctl/`)

Credit-based flow control for backpressure:Mechanism:

Consumer publishes credits (“I can receive N fragments”)
Producer consumes credits before sending
Prevents overrun of slow consumers
Dynamic credit replenishment

Credit types:

Fragment credits: Number of fragments consumer can receive
Chunk credits: Amount of dcache space available

Usage:

// Producer checks credits
if (fctl_credits_avail >= 1) {
  // Send fragment
  fctl_credits_avail--;
}

// Consumer replenishes credits
fctl_publish_credits(processed_count);

Sequence Tracker (fseq)

Fseq (`fseq/`)

Shared sequence number for synchronization:Purpose:

Publisher advertises latest sequence number
Multiple consumers read the same fseq
Enables “pub-sub” patterns
Flow control signal publishing

Memory layout:

// Single 64-bit atomic sequence number
// Typically in shared memory (wksp)
// Cache-line aligned to avoid false sharing

Operations:

// Publisher: Update sequence
fd_fseq_update(fseq, new_seq);

// Consumer: Query latest
ulong latest = fd_fseq_query(fseq);

Transaction Cache (tcache)

Tcache (`tcache/`)

Specialized cache for transaction metadata:Features:

Transaction-specific metadata storage
Deduplication support
Tag-based lookup
Recent transaction tracking

Use case:

Disco dedup tile uses tcache
Maps transaction signature → metadata
Fast duplicate detection

Command & Control (CNC)

CNC (`cnc/`)

Tile lifecycle and health management:Functions:

Tile start/stop signaling
Heartbeat monitoring
State machine coordination
Diagnostic queries

State transitions:

BOOT → INIT → RUNNING → HALT → BOOT
         ↓       ↓
      ERROR    ERROR

Heartbeat:

Periodic “I’m alive” signal
Timeout detection
Automatic restart on failure

Timing (tempo)

Tempo (`tempo/`)

Wallclock and timing utilities:Features:

Synchronized wallclock access
Timestamp compression/decompression
Temporal comparisons
Nanosecond precision

Timestamp compression:

// Compress 64-bit timestamp to 32 bits
ulong ts_comp = fd_frag_meta_ts_comp(ts);

// Decompress using reference timestamp
long ts = fd_frag_meta_ts_decomp(ts_comp, ts_ref);

Accuracy:

±2.1 seconds from reference
Suitable for latency measurements
No clock synchronization required

Message Reassembly

Single-Fragment Messages

Simplest case (most common):

ctl = fd_frag_meta_ctl(origin, /*som*/1, /*eom*/1, /*err*/0);
// SOM and EOM both set → complete message in one fragment

Multi-Fragment Messages

// First fragment
ctl = fd_frag_meta_ctl(origin, /*som*/1, /*eom*/0, /*err*/0);

// Middle fragments
ctl = fd_frag_meta_ctl(origin, /*som*/0, /*eom*/0, /*err*/0);

// Last fragment
ctl = fd_frag_meta_ctl(origin, /*som*/0, /*eom*/1, /*err*/0);

Reassembly logic:

Wait for SOM fragment from origin
Accumulate fragments until EOM
Check for sequence number gaps (overrun)
If ERR bit set, discard entire message

Signature Filtering

Consumers can filter messages without reading payloads:

// Example: Route by protocol type
ulong sig = fd_disco_netmux_sig(hash, port, ip, proto, hdr_sz);

if (fd_disco_netmux_sig_proto(sig) == DST_PROTO_TPU_QUIC) {
  // Handle QUIC packets
} else {
  // Skip without reading dcache
}

Performance Characteristics

Atomic Operations

Tango uses x86 atomic instructions for lock-free updates:

// SSE (16-byte atomic)
__m128i sse0 = fd_frag_meta_seq_sig_query(meta);
// Reads seq and sig atomically (AVX CPUs)

// AVX (32-byte, not guaranteed atomic)
__m256i avx = meta->avx;
// Can hold entire metadata in one register

Memory Ordering

// Compiler fence (not CPU fence)
FD_COMPILER_MFENCE();

// Ensures compiler doesn't reorder loads/stores
// No runtime cost on x86 (TSO memory model)

Benchmarks (typical)

Latency: 20-50 ns (mcache read)
Throughput: 100M+ fragments/sec (single producer)
Scalability: Linear with # of producers
Overrun detection: Zero overhead

Constants & Limits

// Fragment metadata
#define FD_FRAG_META_ALIGN     (32UL)
#define FD_FRAG_META_SZ        (32UL)
#define FD_FRAG_META_ORIG_MAX  (8192UL)

// Chunk allocation
#define FD_CHUNK_ALIGN  (64UL)
#define FD_CHUNK_SZ     (64UL)

// Fragment size limits
#define FD_FRAG_SZ_MAX  (65535UL)  // 16-bit size field

Usage Example

// Publish a fragment
fd_frag_meta_t * meta = mcache + (seq % depth);
void * payload = fd_chunk_to_laddr(chunk0, chunk);

// Write payload
memcpy(payload, data, data_sz);

// Publish metadata atomically
meta->seq = seq;
meta->sig = signature;
meta->chunk = chunk;
meta->sz = (ushort)data_sz;
meta->ctl = fd_frag_meta_ctl(origin, som, eom, err);
meta->tsorig = fd_frag_meta_ts_comp(tsorig);
meta->tspub = fd_frag_meta_ts_comp(tspub);

seq++;

Workspace Integration

Tango objects live in Firedancer workspaces (wksp):

// Create mcache in workspace
fd_wksp_t * wksp = fd_wksp_attach("my_wksp");
void * mem = fd_wksp_alloc(wksp, fd_mcache_align(), fd_mcache_footprint(depth));
fd_mcache_t * mcache = fd_mcache_join(fd_mcache_new(mem, depth));

Benefits:

Huge page backed (2MB/1GB pages)
NUMA-aware allocation
Persistent across processes
Named objects for discovery
Standard UNIX permissions

Control Utility

Tango provides fd_tango_ctl for management and inspection:

# Inspect mcache
fd_tango_ctl mcache query <wksp>:<name>

# Monitor fseq
fd_tango_ctl fseq monitor <wksp>:<name>

# Check CNC health
fd_tango_ctl cnc status <wksp>:<name>

Header Files

#include "tango/fd_tango.h"       // Main header
#include "tango/fd_tango_base.h"  // Base types
#include "tango/mcache/fd_mcache.h"
#include "tango/dcache/fd_dcache.h"
#include "tango/fctl/fd_fctl.h"
#include "tango/fseq/fd_fseq.h"
#include "tango/cnc/fd_cnc.h"

Disco - Uses Tango for tile communication
Util - Workspace management (wksp)

Components

Monitoring

Overview

Core Concepts

Message Fragments

Control Bits

Sequence Numbers

Architecture Components

Mcache (`mcache/`)

Dcache (`dcache/`)

Fctl (`fctl/`)

Fseq (`fseq/`)

Tcache (`tcache/`)

CNC (`cnc/`)

Tempo (`tempo/`)

Message Reassembly

Single-Fragment Messages

Multi-Fragment Messages

Signature Filtering

Performance Characteristics

Atomic Operations

Memory Ordering

Benchmarks (typical)

Constants & Limits

Usage Example

Workspace Integration

Control Utility

Header Files

Build docs developers (and LLMs) love

Components

Monitoring

​Overview

​Core Concepts

​Message Fragments

​Control Bits

​Sequence Numbers

​Architecture Components

​Mcache (mcache/)

​Dcache (dcache/)

​Fctl (fctl/)

​Fseq (fseq/)

​Tcache (tcache/)

​CNC (cnc/)

​Tempo (tempo/)

​Message Reassembly

​Single-Fragment Messages

​Multi-Fragment Messages

​Signature Filtering

​Performance Characteristics

​Atomic Operations

​Memory Ordering

​Benchmarks (typical)

​Constants & Limits

​Usage Example

​Workspace Integration

​Control Utility

​Header Files

​Related Components

Build docs developers (and LLMs) love

Overview

Core Concepts

Message Fragments

Control Bits

Sequence Numbers

Architecture Components

Mcache (`mcache/`)

Dcache (`dcache/`)

Fctl (`fctl/`)

Fseq (`fseq/`)

Tcache (`tcache/`)

CNC (`cnc/`)

Tempo (`tempo/`)

Message Reassembly

Single-Fragment Messages

Multi-Fragment Messages

Signature Filtering

Performance Characteristics

Atomic Operations

Memory Ordering

Benchmarks (typical)

Constants & Limits

Usage Example

Workspace Integration

Control Utility

Header Files

Related Components