Skip to main content

Overview

ARMeilleure is Ryujinx’s custom-built JIT (Just-In-Time) compiler for ARM CPU emulation. It translates ARM64 (and ARM32) guest code into optimized native x86-64 or ARM64 host code at runtime, providing high-performance CPU emulation.
ARMeilleure uses a multi-stage translation pipeline: Decode → IR Translation → Optimization → Register Allocation → Code Generation

Translation Pipeline

Stage 1: Decoding

The decoder (src/ARMeilleure/Decoders/Decoder.cs) performs recursive guest code analysis:
public static Block[] Decode(IMemoryManager memory, ulong address, 
                             ExecutionMode mode, bool highCq, DecoderMode dMode)
{
    List<Block> blocks = [];
    Queue<Block> workQueue = new();
    Dictionary<ulong, Block> visited = new();
    
    int instructionLimit = highCq ? MaxInstsPerFunction : MaxInstsPerFunctionLowCq;
    // Decode blocks recursively...
}
Key features:
  • Basic block construction: Follows control flow (branches, calls, returns)
  • Function size limits: 2500 instructions (high-CQ) or 500 (low-CQ) to prevent excessive compilation time
  • Multi-block analysis: Handles complex control flow graphs
  • Lazy decoding: Only decodes when execution reaches new code regions

Stage 2: IR Translation

Guest instructions are lifted into ARMeilleure’s intermediate representation:
// Operation structure from src/ARMeilleure/IntermediateRepresentation/Operation.cs
internal struct Operation
{
    internal struct Data
    {
        public ushort Instruction;
        public ushort Intrinsic;
        public ushort SourcesCount;
        public ushort DestinationsCount;
        public Operation ListPrevious;
        public Operation ListNext;
        public Operand* Destinations;
        public Operand* Sources;
    }
}
IR characteristics:
  • SSA form support: Static Single Assignment for optimization passes
  • Intrusive linked list: Efficient operation manipulation without allocations
  • Typed operands: I32, I64, FP32, FP64, V128 (SIMD vector)
  • Intrinsics: Hardware-accelerated operations (SIMD, crypto, etc.)
// ARM instruction: ADD X0, X1, X2
// Translates to IR:
Operand dest = GetIntOrZR(op.Rd);  // X0
Operand src1 = GetIntOrZR(op.Rn);  // X1  
Operand src2 = GetIntOrZR(op.Rm);  // X2
Operation add = Operation(Instruction.Add, dest, src1, src2);
context.Add(add);

Stage 3: Optimization Passes

The compiler (src/ARMeilleure/Translation/Compiler.cs) applies optimization passes:
public static CompiledFunction Compile(ControlFlowGraph cfg, 
                                       OperandType[] argTypes,
                                       OperandType retType,
                                       CompilerOptions options,
                                       Architecture target)
{
    CompilerContext cctx = new(cfg, argTypes, retType, options);
    
    if (options.HasFlag(CompilerOptions.Optimize))
    {
        TailMerge.RunPass(cctx);  // Merge duplicate block tails
    }
    
    if (options.HasFlag(CompilerOptions.SsaForm))
    {
        Dominance.FindDominators(cfg);
        Dominance.FindDominanceFrontiers(cfg);
        Ssa.Construct(cfg);  // Convert to SSA form
    }
    
    // Backend-specific code generation
    if (target == Architecture.X64)
        return CodeGen.X86.CodeGenerator.Generate(cctx);
    else if (target == Architecture.Arm64)
        return CodeGen.Arm64.CodeGenerator.Generate(cctx);
}
Optimization techniques:
Constructs Static Single Assignment form for advanced optimizations:
  • Phi node insertion at control flow merge points
  • Def-use chain tracking
  • Enables constant propagation and dead code elimination
Evaluates constant expressions at compile time:
// Before: ADD r0, #5, #3
// After:  MOV r0, #8
Implemented in CodeGen/Optimizations/ConstantFolding.cs
Merges duplicate code at the end of basic blocks to reduce code size and improve instruction cache efficiency.
Reorders basic blocks for:
  • Better branch prediction (hot paths fall through)
  • Improved instruction cache locality
  • Reduced branch penalties

Stage 4: Register Allocation

Two allocation strategies based on compilation tier:
Fast allocation for initial compilation:
// LinearScanAllocator from RegisterAllocators/LinearScanAllocator.cs
// - Single pass through IR
// - Live interval computation
// - Greedy register assignment
// - Stack spilling when registers exhausted
Characteristics:
  • O(n) complexity
  • Minimal compilation overhead
  • Used for first-time execution

Stage 5: Code Generation

Native machine code generation for host architecture:
// From CodeGen/X86/CodeGenerator.cs
public static CompiledFunction Generate(CompilerContext cctx)
{
    ControlFlowGraph cfg = cctx.Cfg;
    
    // Instruction table maps IR operations to code generators
    foreach (BasicBlock block in cfg.Blocks)
    {
        foreach (Operation operation in block.Operations)
        {
            Action<CodeGenContext, Operation> generator = 
                _instTable[(int)operation.Instruction];
            generator(context, operation);
        }
    }
    
    // Map executable memory and return function pointer
    return compiledFunc.MapWithPointer<GuestFunction>(out nint funcPointer);
}
Backend features:

x86-64 Backend

  • SSE/AVX/AVX-512 SIMD support
  • Hardware AES/SHA acceleration
  • Optimized calling conventions (System V / Windows x64)
  • Efficient stack frame management

ARM64 Backend

  • Native ARM64 code on Apple Silicon / Linux ARM
  • NEON SIMD instructions
  • ARM crypto extensions
  • Zero-overhead for ARM→ARM translation

Two-Tier Compilation

ARMeilleure uses adaptive compilation to balance startup time and performance:

Low-CQ (Low Code Quality)

// Fast compilation path
TranslatedFunction func = Translate(address, mode, highCq: false);
// - Minimal optimizations
// - Linear scan register allocation
// - Fast startup
// - Lower runtime performance

High-CQ (High Code Quality)

// Optimization compilation path (background threads)
if (callCount >= 100)
{
    TranslatedFunction func = Translate(address, mode, highCq: true);
    // - Full optimization passes
    // - Advanced register allocation
    // - Slower compilation
    // - Maximum runtime performance
}
Rejit mechanism from src/ARMeilleure/Translation/Translator.cs:479:
internal static void EmitRejitCheck(ArmEmitterContext context, out Counter<uint> counter)
{
    const int MinsCallForRejit = 100;
    
    counter = new Counter<uint>(context.CountTable);
    Operand curCount = context.Load(OperandType.I32, address);
    Operand count = context.Add(curCount, Const(1));
    context.Store(address, count);
    
    // Enqueue for high-CQ recompilation after 100 calls
    context.BranchIf(lblEnd, curCount, Const(MinsCallForRejit), 
                     Comparison.NotEqual, BasicBlockFrequency.Cold);
    context.Call(typeof(NativeInterface).GetMethod(
                 nameof(NativeInterface.EnqueueForRejit)), 
                 Const(context.EntryAddress));
}

Hardware Capabilities Detection

ARMeilleure detects and utilizes host CPU features:
// From src/ARMeilleure/Optimizations.cs
public static class Optimizations
{
    // X86 SIMD extensions
    public static bool UseSseIfAvailable { get; set; } = true;
    public static bool UseAvxIfAvailable { get; set; } = true;
    public static bool UseAvx512FIfAvailable { get; set; } = true;
    
    // Crypto acceleration
    public static bool UseAesniIfAvailable { get; set; } = true;
    public static bool UseShaIfAvailable { get; set; } = true;
    
    // ARM extensions
    public static bool UseAdvSimdIfAvailable { get; set; } = true;
    public static bool UseArm64AesIfAvailable { get; set; } = true;
    
    // Runtime capability check
    internal static bool UseAvx512F => 
        UseAvx512FIfAvailable && X86HardwareCapabilities.SupportsAvx512F;
}
Performance impact: Using AVX-512 can provide 2-4x speedup for vector operations compared to SSE2

Function Cache Management

Translation Cache

// From Translator.cs
internal TranslatorCache<TranslatedFunction> Functions { get; }
internal IAddressTable<ulong> FunctionTable { get; }

internal TranslatedFunction GetOrTranslate(ulong address, ExecutionMode mode)
{
    if (!Functions.TryGetValue(address, out TranslatedFunction func))
    {
        func = Translate(address, mode, highCq: false);
        TranslatedFunction oldFunc = Functions.GetOrAdd(address, func.GuestSize, func);
        
        if (oldFunc != func)
        {
            JitCache.Unmap(func.FuncPointer);  // Race condition, discard
            func = oldFunc;
        }
        
        RegisterFunction(address, func);
    }
    return func;
}

JIT Cache Invalidation

When guest code is modified (self-modifying code, JIT compilers):
public void InvalidateJitCacheRegion(ulong address, ulong size)
{
    ulong[] overlapAddresses = [];
    int overlapsCount = Functions.GetOverlaps(address, size, ref overlapAddresses);
    
    if (overlapsCount != 0)
    {
        ClearRejitQueue(allowRequeue: true);  // Stop background compilation
    }
    
    for (int index = 0; index < overlapsCount; index++)
    {
        ulong overlapAddress = overlapAddresses[index];
        if (Functions.TryGetValue(overlapAddress, out TranslatedFunction overlap))
        {
            Functions.Remove(overlapAddress);
            Volatile.Write(ref FunctionTable.GetValue(overlapAddress), FunctionTable.Fill);
            EnqueueForDeletion(overlapAddress, overlap);
        }
    }
}

PPTC (Profiled Persistent Translation Cache)

ARMeilleure can save and load compiled code across sessions:
1

Profile Collection

During initial gameplay, track which functions are executed frequently and compile them to high-CQ
2

Cache Generation

Serialize compiled functions to disk with:
  • Function address and hash
  • IR representation
  • Compilation metadata
3

Cache Loading

On subsequent launches:
_ptc.Initialize(titleIdText, displayVersion, enabled, Memory.Type, cacheSelector);
_ptc.LoadTranslations(this);
_ptc.MakeAndSaveTranslations(this);
4

Validation

Verify cached functions against current guest code using hash comparison
PPTC reduces startup stutter significantly but requires disk space (typically 50-200 MB per game)

Dispatch Mechanisms

Managed Dispatch Loop

// Simple C# loop for debugging
do
{
    address = ExecuteSingle(context, address);
}
while (context.Running && address != 0);

private ulong ExecuteSingle(State.ExecutionContext context, ulong address)
{
    TranslatedFunction func = GetOrTranslate(address, context.ExecutionMode);
    return func.Execute(Stubs.ContextWrapper, context);
}

Unmanaged Dispatch Loop

// High-performance native dispatch
if (Optimizations.UseUnmanagedDispatchLoop)
{
    Stubs.DispatchLoop(context.NativeContextPtr, address);
}
Benefits:
  • Eliminates managed/native transitions
  • Direct function table lookups
  • Lower overhead for function calls
  • 5-15% performance improvement

Performance Characteristics

Startup

Low-CQ compilation:
  • ~0.1-0.5ms per function
  • Minimal stuttering
  • Gradual warmup

Runtime

Execution speed:
  • 70-90% of native ARM hardware (x86 host)
  • 95-100% of native (ARM host)
  • High-CQ provides 20-40% speedup over low-CQ

Memory

Cache usage:
  • ~2-10 KB per compiled function
  • Function table: 8 bytes per 4KB page
  • Total: 50-500 MB per game

Debugging Support

ARMeilleure includes integrated debugging capabilities:
if (Optimizations.EnableDebugging)
{
    context.DebugPc = address;
    do
    {
        if (Interlocked.CompareExchange(ref context.ShouldStep, 0, 1) == 1)
        {
            context.DebugPc = Step(context, context.DebugPc);
            context.StepHandler();
        }
        else
        {
            context.DebugPc = ExecuteSingle(context, context.DebugPc);
        }
        context.CheckInterrupt();
    }
    while (context.Running && context.DebugPc != 0);
}
Features:
  • Single-step execution
  • Precise PC tracking
  • GDB stub integration (see Debugging)
  • Breakpoint support

Memory Management

How ARMeilleure interfaces with guest memory

HLE Services

How translated code calls into HLE services

Graphics Integration

GPU command submission from translated code

Performance Tuning

Optimization settings for ARMeilleure

Source Code Reference

Key files to explore:
  • src/ARMeilleure/Translation/Translator.cs:22 - Main translator entry point
  • src/ARMeilleure/Translation/Compiler.cs:12 - Optimization pipeline
  • src/ARMeilleure/Decoders/Decoder.cs:11 - Instruction decoder
  • src/ARMeilleure/IntermediateRepresentation/Operation.cs:7 - IR operation structure
  • src/ARMeilleure/CodeGen/X86/CodeGenerator.cs:17 - x86-64 code generation
  • src/ARMeilleure/CodeGen/Arm64/CodeGenerator.cs - ARM64 code generation

Build docs developers (and LLMs) love