ARMeilleure CPU Emulator

Overview

ARMeilleure is Ryujinx’s custom-built JIT (Just-In-Time) compiler for ARM CPU emulation. It translates ARM64 (and ARM32) guest code into optimized native x86-64 or ARM64 host code at runtime, providing high-performance CPU emulation.

ARMeilleure uses a multi-stage translation pipeline: Decode → IR Translation → Optimization → Register Allocation → Code Generation

Translation Pipeline

Stage 1: Decoding

The decoder (src/ARMeilleure/Decoders/Decoder.cs) performs recursive guest code analysis:

public static Block[] Decode(IMemoryManager memory, ulong address, 
                             ExecutionMode mode, bool highCq, DecoderMode dMode)
{
    List<Block> blocks = [];
    Queue<Block> workQueue = new();
    Dictionary<ulong, Block> visited = new();
    
    int instructionLimit = highCq ? MaxInstsPerFunction : MaxInstsPerFunctionLowCq;
    // Decode blocks recursively...
}

Key features:

Basic block construction: Follows control flow (branches, calls, returns)
Function size limits: 2500 instructions (high-CQ) or 500 (low-CQ) to prevent excessive compilation time
Multi-block analysis: Handles complex control flow graphs
Lazy decoding: Only decodes when execution reaches new code regions

Stage 2: IR Translation

Guest instructions are lifted into ARMeilleure’s intermediate representation:

// Operation structure from src/ARMeilleure/IntermediateRepresentation/Operation.cs
internal struct Operation
{
    internal struct Data
    {
        public ushort Instruction;
        public ushort Intrinsic;
        public ushort SourcesCount;
        public ushort DestinationsCount;
        public Operation ListPrevious;
        public Operation ListNext;
        public Operand* Destinations;
        public Operand* Sources;
    }
}

IR characteristics:

SSA form support: Static Single Assignment for optimization passes
Intrusive linked list: Efficient operation manipulation without allocations
Typed operands: I32, I64, FP32, FP64, V128 (SIMD vector)
Intrinsics: Hardware-accelerated operations (SIMD, crypto, etc.)

// ARM instruction: ADD X0, X1, X2
// Translates to IR:
Operand dest = GetIntOrZR(op.Rd);  // X0
Operand src1 = GetIntOrZR(op.Rn);  // X1  
Operand src2 = GetIntOrZR(op.Rm);  // X2
Operation add = Operation(Instruction.Add, dest, src1, src2);
context.Add(add);

Stage 3: Optimization Passes

The compiler (src/ARMeilleure/Translation/Compiler.cs) applies optimization passes:

public static CompiledFunction Compile(ControlFlowGraph cfg, 
                                       OperandType[] argTypes,
                                       OperandType retType,
                                       CompilerOptions options,
                                       Architecture target)
{
    CompilerContext cctx = new(cfg, argTypes, retType, options);
    
    if (options.HasFlag(CompilerOptions.Optimize))
    {
        TailMerge.RunPass(cctx);  // Merge duplicate block tails
    }
    
    if (options.HasFlag(CompilerOptions.SsaForm))
    {
        Dominance.FindDominators(cfg);
        Dominance.FindDominanceFrontiers(cfg);
        Ssa.Construct(cfg);  // Convert to SSA form
    }
    
    // Backend-specific code generation
    if (target == Architecture.X64)
        return CodeGen.X86.CodeGenerator.Generate(cctx);
    else if (target == Architecture.Arm64)
        return CodeGen.Arm64.CodeGenerator.Generate(cctx);
}

Optimization techniques:

SSA Construction

Constructs Static Single Assignment form for advanced optimizations:

Phi node insertion at control flow merge points
Def-use chain tracking
Enables constant propagation and dead code elimination

Constant Folding

Evaluates constant expressions at compile time:

// Before: ADD r0, #5, #3
// After:  MOV r0, #8

Implemented in CodeGen/Optimizations/ConstantFolding.cs

Tail Merge

Merges duplicate code at the end of basic blocks to reduce code size and improve instruction cache efficiency.

Block Placement

Reorders basic blocks for:

Better branch prediction (hot paths fall through)
Improved instruction cache locality
Reduced branch penalties

Stage 4: Register Allocation

Two allocation strategies based on compilation tier:

Linear Scan (Low-CQ)
Hybrid (High-CQ)

Fast allocation for initial compilation:

// LinearScanAllocator from RegisterAllocators/LinearScanAllocator.cs
// - Single pass through IR
// - Live interval computation
// - Greedy register assignment
// - Stack spilling when registers exhausted

Characteristics:

O(n) complexity
Minimal compilation overhead
Used for first-time execution

Advanced allocation for hot code:

// HybridAllocator from RegisterAllocators/HybridAllocator.cs
// - Graph coloring for frequently used values
// - Linear scan for cold regions
// - Copy coalescing
// - Optimal register pressure management

Characteristics:

Higher quality allocation
Reduced spill code
Used after rejit threshold (100+ calls)

Stage 5: Code Generation

Native machine code generation for host architecture:

// From CodeGen/X86/CodeGenerator.cs
public static CompiledFunction Generate(CompilerContext cctx)
{
    ControlFlowGraph cfg = cctx.Cfg;
    
    // Instruction table maps IR operations to code generators
    foreach (BasicBlock block in cfg.Blocks)
    {
        foreach (Operation operation in block.Operations)
        {
            Action<CodeGenContext, Operation> generator = 
                _instTable[(int)operation.Instruction];
            generator(context, operation);
        }
    }
    
    // Map executable memory and return function pointer
    return compiledFunc.MapWithPointer<GuestFunction>(out nint funcPointer);
}

Backend features:

x86-64 Backend

SSE/AVX/AVX-512 SIMD support
Hardware AES/SHA acceleration
Optimized calling conventions (System V / Windows x64)
Efficient stack frame management

ARM64 Backend

Native ARM64 code on Apple Silicon / Linux ARM
NEON SIMD instructions
ARM crypto extensions
Zero-overhead for ARM→ARM translation

Two-Tier Compilation

ARMeilleure uses adaptive compilation to balance startup time and performance:

Low-CQ (Low Code Quality)

// Fast compilation path
TranslatedFunction func = Translate(address, mode, highCq: false);
// - Minimal optimizations
// - Linear scan register allocation
// - Fast startup
// - Lower runtime performance

High-CQ (High Code Quality)

// Optimization compilation path (background threads)
if (callCount >= 100)
{
    TranslatedFunction func = Translate(address, mode, highCq: true);
    // - Full optimization passes
    // - Advanced register allocation
    // - Slower compilation
    // - Maximum runtime performance
}

Rejit mechanism from src/ARMeilleure/Translation/Translator.cs:479:

internal static void EmitRejitCheck(ArmEmitterContext context, out Counter<uint> counter)
{
    const int MinsCallForRejit = 100;
    
    counter = new Counter<uint>(context.CountTable);
    Operand curCount = context.Load(OperandType.I32, address);
    Operand count = context.Add(curCount, Const(1));
    context.Store(address, count);
    
    // Enqueue for high-CQ recompilation after 100 calls
    context.BranchIf(lblEnd, curCount, Const(MinsCallForRejit), 
                     Comparison.NotEqual, BasicBlockFrequency.Cold);
    context.Call(typeof(NativeInterface).GetMethod(
                 nameof(NativeInterface.EnqueueForRejit)), 
                 Const(context.EntryAddress));
}

Hardware Capabilities Detection

ARMeilleure detects and utilizes host CPU features:

// From src/ARMeilleure/Optimizations.cs
public static class Optimizations
{
    // X86 SIMD extensions
    public static bool UseSseIfAvailable { get; set; } = true;
    public static bool UseAvxIfAvailable { get; set; } = true;
    public static bool UseAvx512FIfAvailable { get; set; } = true;
    
    // Crypto acceleration
    public static bool UseAesniIfAvailable { get; set; } = true;
    public static bool UseShaIfAvailable { get; set; } = true;
    
    // ARM extensions
    public static bool UseAdvSimdIfAvailable { get; set; } = true;
    public static bool UseArm64AesIfAvailable { get; set; } = true;
    
    // Runtime capability check
    internal static bool UseAvx512F => 
        UseAvx512FIfAvailable && X86HardwareCapabilities.SupportsAvx512F;
}

Performance impact: Using AVX-512 can provide 2-4x speedup for vector operations compared to SSE2

Function Cache Management

Translation Cache

// From Translator.cs
internal TranslatorCache<TranslatedFunction> Functions { get; }
internal IAddressTable<ulong> FunctionTable { get; }

internal TranslatedFunction GetOrTranslate(ulong address, ExecutionMode mode)
{
    if (!Functions.TryGetValue(address, out TranslatedFunction func))
    {
        func = Translate(address, mode, highCq: false);
        TranslatedFunction oldFunc = Functions.GetOrAdd(address, func.GuestSize, func);
        
        if (oldFunc != func)
        {
            JitCache.Unmap(func.FuncPointer);  // Race condition, discard
            func = oldFunc;
        }
        
        RegisterFunction(address, func);
    }
    return func;
}

JIT Cache Invalidation

When guest code is modified (self-modifying code, JIT compilers):

public void InvalidateJitCacheRegion(ulong address, ulong size)
{
    ulong[] overlapAddresses = [];
    int overlapsCount = Functions.GetOverlaps(address, size, ref overlapAddresses);
    
    if (overlapsCount != 0)
    {
        ClearRejitQueue(allowRequeue: true);  // Stop background compilation
    }
    
    for (int index = 0; index < overlapsCount; index++)
    {
        ulong overlapAddress = overlapAddresses[index];
        if (Functions.TryGetValue(overlapAddress, out TranslatedFunction overlap))
        {
            Functions.Remove(overlapAddress);
            Volatile.Write(ref FunctionTable.GetValue(overlapAddress), FunctionTable.Fill);
            EnqueueForDeletion(overlapAddress, overlap);
        }
    }
}

PPTC (Profiled Persistent Translation Cache)

ARMeilleure can save and load compiled code across sessions:

Profile Collection

During initial gameplay, track which functions are executed frequently and compile them to high-CQ

Cache Generation

Serialize compiled functions to disk with:

Function address and hash
IR representation
Compilation metadata

Cache Loading

On subsequent launches:

_ptc.Initialize(titleIdText, displayVersion, enabled, Memory.Type, cacheSelector);
_ptc.LoadTranslations(this);
_ptc.MakeAndSaveTranslations(this);

Validation

Verify cached functions against current guest code using hash comparison

PPTC reduces startup stutter significantly but requires disk space (typically 50-200 MB per game)

Dispatch Mechanisms

Managed Dispatch Loop

// Simple C# loop for debugging
do
{
    address = ExecuteSingle(context, address);
}
while (context.Running && address != 0);

private ulong ExecuteSingle(State.ExecutionContext context, ulong address)
{
    TranslatedFunction func = GetOrTranslate(address, context.ExecutionMode);
    return func.Execute(Stubs.ContextWrapper, context);
}

Unmanaged Dispatch Loop

// High-performance native dispatch
if (Optimizations.UseUnmanagedDispatchLoop)
{
    Stubs.DispatchLoop(context.NativeContextPtr, address);
}

Benefits:

Eliminates managed/native transitions
Direct function table lookups
Lower overhead for function calls
5-15% performance improvement

Performance Characteristics

Startup

Low-CQ compilation:

~0.1-0.5ms per function
Minimal stuttering
Gradual warmup

Runtime

Execution speed:

70-90% of native ARM hardware (x86 host)
95-100% of native (ARM host)
High-CQ provides 20-40% speedup over low-CQ

Memory

Cache usage:

~2-10 KB per compiled function
Function table: 8 bytes per 4KB page
Total: 50-500 MB per game

Debugging Support

ARMeilleure includes integrated debugging capabilities:

if (Optimizations.EnableDebugging)
{
    context.DebugPc = address;
    do
    {
        if (Interlocked.CompareExchange(ref context.ShouldStep, 0, 1) == 1)
        {
            context.DebugPc = Step(context, context.DebugPc);
            context.StepHandler();
        }
        else
        {
            context.DebugPc = ExecuteSingle(context, context.DebugPc);
        }
        context.CheckInterrupt();
    }
    while (context.Running && context.DebugPc != 0);
}

Features:

Single-step execution
Precise PC tracking
GDB stub integration (see Debugging)
Breakpoint support

Memory Management

How ARMeilleure interfaces with guest memory

HLE Services

How translated code calls into HLE services

Graphics Integration

GPU command submission from translated code

Performance Tuning

Optimization settings for ARMeilleure

Source Code Reference

Key files to explore:

src/ARMeilleure/Translation/Translator.cs:22 - Main translator entry point
src/ARMeilleure/Translation/Compiler.cs:12 - Optimization pipeline
src/ARMeilleure/Decoders/Decoder.cs:11 - Instruction decoder
src/ARMeilleure/IntermediateRepresentation/Operation.cs:7 - IR operation structure
src/ARMeilleure/CodeGen/X86/CodeGenerator.cs:17 - x86-64 code generation
src/ARMeilleure/CodeGen/Arm64/CodeGenerator.cs - ARM64 code generation

Overview

Core Components

Graphics

ARMeilleure CPU Emulator

Overview

Translation Pipeline

Stage 1: Decoding

Stage 2: IR Translation

Stage 3: Optimization Passes

Stage 4: Register Allocation

Stage 5: Code Generation

x86-64 Backend

ARM64 Backend

Two-Tier Compilation

Low-CQ (Low Code Quality)

High-CQ (High Code Quality)

Hardware Capabilities Detection

Function Cache Management

Translation Cache

JIT Cache Invalidation

PPTC (Profiled Persistent Translation Cache)

Dispatch Mechanisms

Managed Dispatch Loop

Unmanaged Dispatch Loop

Performance Characteristics

Startup

Runtime

Memory

Debugging Support

Memory Management

HLE Services

Graphics Integration

Performance Tuning

Source Code Reference

Build docs developers (and LLMs) love

Overview

Core Components

Graphics

​Overview

​Translation Pipeline

​Stage 1: Decoding

​Stage 2: IR Translation

​Stage 3: Optimization Passes

​Stage 4: Register Allocation

​Stage 5: Code Generation

x86-64 Backend

ARM64 Backend

​Two-Tier Compilation

​Low-CQ (Low Code Quality)

​High-CQ (High Code Quality)

​Hardware Capabilities Detection

​Function Cache Management

​Translation Cache

​JIT Cache Invalidation

​PPTC (Profiled Persistent Translation Cache)

​Dispatch Mechanisms

​Managed Dispatch Loop

​Unmanaged Dispatch Loop

​Performance Characteristics

Startup

Runtime

Memory

​Debugging Support

​Related Topics

Memory Management

HLE Services

Graphics Integration

Performance Tuning

​Source Code Reference

Build docs developers (and LLMs) love

Overview

Translation Pipeline

Stage 1: Decoding

Stage 2: IR Translation

Stage 3: Optimization Passes

Stage 4: Register Allocation

Stage 5: Code Generation

Two-Tier Compilation

Low-CQ (Low Code Quality)

High-CQ (High Code Quality)

Hardware Capabilities Detection

Function Cache Management

Translation Cache

JIT Cache Invalidation

PPTC (Profiled Persistent Translation Cache)

Dispatch Mechanisms

Managed Dispatch Loop

Unmanaged Dispatch Loop

Performance Characteristics

Debugging Support

Related Topics

Source Code Reference