Skip to main content

Overview

Oboromi’s GPU emulation layer translates NVIDIA SM86 (Ada Lovelace/Ampere) shader instructions into SPIR-V for execution on Vulkan-compatible hardware. This enables host GPU acceleration without requiring NVIDIA hardware.

Architecture

The GPU emulation consists of three main components:
  1. SM86 Decoder - Parses 128-bit SASS instructions into structured data
  2. SPIR-V Emitter - Generates valid SPIR-V binary modules
  3. Vulkan State - Manages Vulkan instance and execution context
┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│ SASS Binary │─────▶│ SM86 Decoder │─────▶│   SPIR-V    │
│  (128-bit)  │      │   (Rust)     │      │   Emitter   │
└─────────────┘      └──────────────┘      └─────────────┘


                                            ┌─────────────┐
                                            │   Vulkan    │
                                            │   Runtime   │
                                            └─────────────┘

SM86 Instruction Decoder

The decoder (defined in core/src/gpu/sm86.rs) is a code-generated parser for NVIDIA’s SASS instruction format.

Decoder Structure

pub struct Decoder<'a> {
    pub ir: &'a mut spirv::Emitter,
    type_void: u32,
    type_ptr_u32: u32,
    // Type IDs for different bit widths and components
    type_u8: [u32; 5],
    type_u16: [u32; 5],
    type_u32: [u32; 5],
    type_u64: [u32; 5],
    type_s8: [u32; 5],
    type_s16: [u32; 5],
    type_s32: [u32; 5],
    type_s64: [u32; 5],
    type_f16: [u32; 5],
    type_f32: [u32; 5],
    type_f64: [u32; 5],
    type_bool: [u32; 5],
    // Virtual register file (254 registers)
    regs: [u32; MAX_REG_COUNT],  // MAX_REG_COUNT = 254
}

Initialization

The decoder pre-allocates all SPIR-V types and registers:
pub fn init(&mut self) {
    self.type_void = self.ir.emit_type_void();
    self.type_u8[1] = self.ir.emit_type_int(8, 0);
    self.type_u16[1] = self.ir.emit_type_int(16, 0);
    self.type_u32[1] = self.ir.emit_type_int(32, 0);
    self.type_u64[1] = self.ir.emit_type_int(64, 0);
    self.type_f32[1] = self.ir.emit_type_float(32);
    // ... more type declarations ...

    // Create vector types (2, 3, 4 components)
    for i in 2..=4 {
        for type_sxx in [
            self.type_u8, self.type_u16, self.type_u32, self.type_u64,
            self.type_s8, self.type_s16, self.type_s32, self.type_s64,
            self.type_f16, self.type_f32, self.type_f64, self.type_bool
        ] {
            self.ir.emit_type_vector(type_sxx[i], i as u32);
        }
    }

    // Define generic pointers (storage class 7 = Function)
    self.type_ptr_u32 = self.ir.emit_type_pointer(7, self.type_u32[1]);

    // Allocate register variables
    for r in self.regs.iter_mut() {
        *r = self.ir.emit_variable(self.type_ptr_u32, 7);
    }
}
The type arrays use index 0 as unused - scalar types start at index 1, vectors at indices 2-4.

Register File Operations

Zero Register (RZ)

Register 255 is the special zero register:
fn load_reg(&mut self, reg: usize) -> u32 {
    if reg == 255 {
        // RZ (Zero Register) - always reads as 0
        return self.ir.emit_constant_typed(self.type_u32[1], 0u32);
    }
    assert!(reg < self.regs.len(), "Register index out of bounds");
    let ptr = self.regs[reg];
    self.ir.emit_load(self.type_u32[1], ptr)
}

fn store_reg(&mut self, reg: usize, val: u32) {
    if reg == 255 {
        // Write to RZ is ignored
        return;
    }
    assert!(reg < self.regs.len(), "Register index out of bounds");
    let ptr = self.regs[reg];
    self.ir.emit_store(ptr, val);
}

Instruction Example: AL2P

The Address Load 2 Pointer instruction demonstrates the decoding process:
// %rd := %ra + $ra_offset
pub fn al2p(&mut self, inst: u128) {
    let _pg = (((inst >> 12) & 0x7) << 0);           // Predicate guard
    let _pg_not = (((inst >> 15) & 0x1) << 0);       // Predicate negate
    let rd = (((inst >> 16) & 0xff) << 0) as usize;  // Destination register
    let ra = (((inst >> 24) & 0xff) << 0) as usize;  // Source register
    let ra_offset = (((inst >> 40) & 0x7ff) << 0) as usize;  // Immediate offset
    let bop = (((inst >> 74) & 0x3) << 0) as usize;  // Bit operation size
    
    assert!(ra <= MAX_REG_COUNT || ra == 255);
    assert!(bop == BitSize::B32 as usize);
    
    // Load source register value
    let base = self.load_reg(ra);
    
    // Create constant for offset
    let offset = self.ir.emit_constant_typed(self.type_u32[1], ra_offset as u32);
    
    // Emit integer addition: dst = base + offset
    let dst_val = self.ir.emit_iadd(self.type_u32[1], base, offset);
    
    // Store to destination register
    self.store_reg(rd, dst_val);
}
Bit Field Extraction: The instruction decoding uses bit manipulation to extract fields from the 128-bit instruction word:
// Pattern: (((inst >> shift) & mask) << output_shift)
let rd = (((inst >> 16) & 0xff) << 0) as usize;
//          └──shift─┘   └mask┘  └─0─┘  └─cast─┘

SPIR-V Generation

The SPIR-V emitter (core/src/gpu/spirv.rs) provides a safe Rust API for generating SPIR-V binary modules.

Core Emitter

pub struct Emitter {
    words: Vec<u32>,       // Output SPIR-V word stream
    next_id: u32,          // Next available ID (1-based)
    bound_idx: usize,      // Index of ID bound in header
}

impl Emitter {
    pub fn new() -> Self {
        Self {
            words: Vec::with_capacity(4096),
            next_id: 1,  // SPIR-V IDs are 1-based; 0 is reserved
            bound_idx: 0,
        }
    }

    #[inline]
    pub fn alloc_id(&mut self) -> u32 {
        let id = self.next_id;
        self.next_id += 1;
        id
    }
}

Header Emission

pub fn emit_header(&mut self) {
    self.words.push(0x07230203);    // magic
    self.words.push(0x00010500);    // version 1.5
    self.words.push(0);             // generator (unregistered)
    self.bound_idx = self.words.len();
    self.words.push(0);             // bound (patched by finalize)
    self.words.push(0);             // schema
}

pub fn finalize(&mut self) {
    if self.bound_idx < self.words.len() {
        self.words[self.bound_idx] = self.next_id;
    }
}
Always call finalize() after emitting all instructions to patch the ID bound in the header.

Type System

// Integers: width = 8|16|32|64, sign = 0 (unsigned) or 1 (signed)
pub fn emit_type_int(&mut self, width: u32, sign: u32) -> u32 {
    debug_assert!(width == 8 || width == 16 || width == 32 || width == 64);
    debug_assert!(sign <= 1);
    let r = self.alloc_id();
    self.inst(21, &[r, width, sign]);
    r
}

// Floats: width = 16|32|64
pub fn emit_type_float(&mut self, width: u32) -> u32 {
    debug_assert!(width == 16 || width == 32 || width == 64);
    let r = self.alloc_id();
    self.inst(22, &[r, width]);
    r
}

// Booleans
pub fn emit_type_bool(&mut self) -> u32 {
    let r = self.alloc_id();
    self.inst(20, &[r]);
    r
}

Arithmetic Operations

// Integer operations
pub fn emit_iadd(&mut self, ty: u32, a: u32, b: u32) -> u32 { 
    self.typed_bin(128, ty, a, b) 
}
pub fn emit_isub(&mut self, ty: u32, a: u32, b: u32) -> u32 { 
    self.typed_bin(130, ty, a, b) 
}
pub fn emit_imul(&mut self, ty: u32, a: u32, b: u32) -> u32 { 
    self.typed_bin(132, ty, a, b) 
}

// Float operations
pub fn emit_fadd(&mut self, ty: u32, a: u32, b: u32) -> u32 { 
    self.typed_bin(129, ty, a, b) 
}
pub fn emit_fmul(&mut self, ty: u32, a: u32, b: u32) -> u32 { 
    self.typed_bin(133, ty, a, b) 
}

// Helper for binary operations
fn typed_bin(&mut self, op: u32, ty: u32, a: u32, b: u32) -> u32 {
    let r = self.alloc_id();
    self.inst(op, &[ty, r, a, b]);
    r
}

Memory Operations

pub fn emit_variable(&mut self, ty: u32, storage_class: u32) -> u32 {
    let r = self.alloc_id();
    self.inst(59, &[ty, r, storage_class]);
    r
}

pub fn emit_variable_init(&mut self, ty: u32, storage_class: u32, initializer: u32) -> u32 {
    let r = self.alloc_id();
    self.inst(59, &[ty, r, storage_class, initializer]);
    r
}

Control Flow

// Basic blocks
pub fn emit_label(&mut self) -> u32 {
    let r = self.alloc_id();
    self.inst(248, &[r]);
    r
}

// Branching
pub fn emit_branch(&mut self, target: u32) {
    self.inst(249, &[target]);
}

pub fn emit_branch_conditional(&mut self, cond: u32, true_label: u32, false_label: u32) {
    self.inst(250, &[cond, true_label, false_label]);
}

// Function termination
pub fn emit_return(&mut self) {
    self.inst(253, &[]);
}

pub fn emit_return_value(&mut self, value: u32) {
    self.inst(254, &[value]);
}

Vulkan Integration

The GPU state manages Vulkan context:
pub struct VkState {
    pub entry: ash::Entry,
    pub instance: ash::Instance,
}

impl VkState {
    pub fn init(&mut self) -> ash::prelude::VkResult<()> {
        self.entry = unsafe { ash::Entry::load().unwrap() };
        self.instance = unsafe {
            self.entry.create_instance(&vk::InstanceCreateInfo {
                p_application_info: &vk::ApplicationInfo {
                    api_version: vk::make_api_version(0, 1, 0, 0),
                    ..Default::default()
                },
                ..Default::default()
            }, None)?
        };
        Ok(())
    }
}

Texture Formats

The decoder includes comprehensive format enumerations:
#[repr(u32)]
enum SurfaceFormat {
    RGBA32_FLOAT = 0x00c0,
    RGBA32_SINT = 0x00c1,
    RGBA32_UINT = 0x00c2,
    RGBA16_FLOAT = 0x00ca,
    RGBA8_UNORM = 0x00d5,
    RGBA8_SRGB = 0x00d6,
    RG16_FLOAT = 0x00de,
    R32_FLOAT = 0x00e5,
    R8_UNORM = 0x00f3,
    // ... 40+ formats total ...
}
enum TextureType {
    ONE_D = 0,
    TWO_D = 1,
    THREE_D = 2,
    CUBEMAP = 3,
    ONE_D_ARRAY = 4,
    TWO_D_ARRAY = 5,
    ONE_D_BUFFER = 6,
    TWO_D_NO_MIPMAP = 7,
    CUBE_ARRAY = 8,
}

Design Decisions

Why SPIR-V?

Portability

SPIR-V is the standard IR for Vulkan, ensuring compatibility across all Vulkan-capable GPUs (NVIDIA, AMD, Intel, Apple).

Validation

SPIR-V has well-defined validation rules and mature tooling (spirv-val, spirv-cross).

Optimization

Driver compilers can optimize SPIR-V effectively, often matching or exceeding native shader performance.

Debugging

SPIR-V tools enable shader debugging and analysis without vendor-specific tools.

Translation Challenges

  1. Instruction Set Differences
    • SM86 has 300+ unique instructions
    • Many map to SPIR-V extended instructions (GLSL.std.450)
    • Some require multi-instruction sequences
  2. Register Allocation
    • SM86: 255 physical registers + RZ
    • SPIR-V: Unlimited virtual registers (SSA form)
    • Current approach: Pre-allocate 254 SPIR-V variables
  3. Predication
    • SM86 uses per-instruction predicates
    • SPIR-V uses structured control flow
    • Translation requires control flow reconstruction

Current Limitations

Most instruction handlers are currently stubbed with todo!() macros:
pub fn ald(&mut self, inst: u128) {
    let _pg = (((inst >> 12) & 0x7) << 0);
    let _pg_not = (((inst >> 15) & 0x1) << 0);
    let _rd = (((inst >> 16) & 0xff) << 0);
    // ... field extraction ...
    todo!();
}
Priority instruction implementations:
  • Memory load/store (ALD, AST, ATOM)
  • Arithmetic (FADD, FMUL, FFMA, IADD, IMUL)
  • Control flow (BRA, BRX, CALL, EXIT)
  • Texture operations (TEX, TLD, SUTP)

Source Files

  • SM86 Decoder: core/src/gpu/sm86.rs:1-1178
  • SPIR-V Emitter: core/src/gpu/spirv.rs:1-1184
  • GPU Module: core/src/gpu/mod.rs:1-62

Build docs developers (and LLMs) love