Skip to main content

Overview

Performance is critical for emulation. This guide covers profiling tools, optimization techniques, and performance analysis for Ryujinx development.
Measure first, optimize second. Always profile before optimizing to avoid premature optimization.

Build for Performance

Release Configuration

# Build optimized release version
dotnet build -c Release

# Publish optimized for specific platform
dotnet publish -c Release -r win-x64

Optimization Flags

From src/ARMeilleure/ARMeilleure.csproj:
<PropertyGroup>
  <AllowUnsafeBlocks>True</AllowUnsafeBlocks>
  <Optimize>True</Optimize>
</PropertyGroup>

JIT Optimizations

Ryujinx uses ARMeilleure (ARM JIT compiler) for CPU emulation:
// From src/ARMeilleure/Optimizations.cs
public static class Optimizations
{
    public static bool AllowLcqInFunctionTable { get; set; } = true;
    public static bool UseUnmanagedDispatchLoop { get; set; } = true;
}
These optimizations are disabled during testing for faster test execution (from src/Ryujinx.Tests/Cpu/CpuTest.cs:65-66).

Profiling Tools

Built-in .NET Profilers

Real-time performance metrics:
# Install
dotnet tool install -g dotnet-counters

# Monitor running process
dotnet-counters monitor --process-id <PID>

# Monitor specific counters
dotnet-counters monitor --process-id <PID> \
  System.Runtime[cpu-usage,working-set,gc-heap-size]

Visual Studio Profiler

1

Start profiling session

Debug → Performance Profiler or Alt+F2
2

Select profiling tools

  • CPU Usage: Find hot paths
  • .NET Object Allocation: Memory allocations
  • Instrumentation: Detailed timing
  • GPU Usage: Graphics performance
3

Start profiling

Click Start to launch with profiling
4

Analyze results

Review flame graphs, call trees, and hot paths

JetBrains dotTrace

1

Profile application

Run → Profile in Rider
2

Choose profiling mode

  • Sampling: Low overhead, statistical
  • Tracing: Accurate, higher overhead
  • Line-by-line: Most detailed
3

Analyze timeline

View CPU usage over time and identify spikes
4

Inspect call tree

Find methods consuming most CPU time

PerfView (Free, Windows)

# Download from https://github.com/microsoft/perfview

# Collect trace
PerfView.exe collect

# Analyze trace
PerfView.exe <trace-file>.etl

Benchmarking

BenchmarkDotNet

The gold standard for .NET micro-benchmarking:
1

Install package

<PackageReference Include="BenchmarkDotNet" Version="0.13.12" />
2

Create benchmark class

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class ShaderCacheBenchmarks
{
    private ShaderCache _cache;
    
    [GlobalSetup]
    public void Setup()
    {
        _cache = new ShaderCache();
    }
    
    [Benchmark]
    public void GetShaderProgram()
    {
        _cache.GetProgram(0x1234);
    }
    
    [Benchmark]
    public void CompileShader()
    {
        _cache.CompileShader(shaderCode);
    }
}
3

Run benchmarks

class Program
{
    static void Main(string[] args)
    {
        BenchmarkRunner.Run<ShaderCacheBenchmarks>();
    }
}

Benchmark Attributes

[Benchmark]
public void MyMethod() { }
[GlobalSetup]
public void Setup() { }
[IterationSetup]
public void IterationSetup() { }
[MemoryDiagnoser]
public class MyBenchmarks { }
[Params(100, 1000, 10000)]
public int Size { get; set; }

Simple Performance Measurement

using System.Diagnostics;

// Quick measurement
var sw = Stopwatch.StartNew();
DoWork();
sw.Stop();
Logger.Info?.Print(LogClass.Application, 
    $"Operation took {sw.ElapsedMilliseconds}ms");

// High-resolution timing
long start = Stopwatch.GetTimestamp();
DoWork();
long end = Stopwatch.GetTimestamp();
double elapsedMs = (end - start) * 1000.0 / Stopwatch.Frequency;

Optimization Techniques

Memory Allocation Optimization

// Bad: allocates array
byte[] buffer = new byte[1024];
ProcessData(buffer);

// Good: stack allocation
Span<byte> buffer = stackalloc byte[1024];
ProcessData(buffer);

CPU Optimization

using System.Runtime.CompilerServices;

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private int FastMethod(int x)
{
    return x * 2;
}

Data Structure Optimization

// Use appropriate collections

// Fast lookup: O(1)
var dict = new Dictionary<int, string>();

// Fast iteration
var list = new List<int>();

// Concurrent access
var concurrent = new ConcurrentDictionary<int, string>();

// Memory-efficient
var span = new Span<byte>(buffer);

GPU Performance

Shader Compilation

From src/Ryujinx.Graphics.Gpu/Shader/ShaderCache.cs:
// Cache compiled shaders to avoid recompilation
private readonly Dictionary<ulong, CachedShaderProgram> _programCache;

public ShaderProgram GetProgram(ulong address)
{
    if (_programCache.TryGetValue(address, out var cached))
    {
        return cached.Program; // Fast path
    }
    
    // Compile and cache
    var program = CompileShader(address);
    _programCache[address] = new CachedShaderProgram(program);
    return program;
}

Texture Caching

  • Reuse texture resources
  • Compress textures when possible
  • Use appropriate texture formats
  • Implement mipmap generation efficiently

Memory Performance

Memory Profiling

1

Take memory snapshot

Visual Studio: Debug → Memory Usage → Take Snapshot
2

Perform operation

Execute the code you want to analyze
3

Take second snapshot

Compare snapshots to see allocations
4

Analyze differences

Identify objects that weren’t garbage collected

Common Memory Issues

Symptom: Memory usage grows over timeCauses:
  • Event handlers not unsubscribed
  • Static collections holding references
  • IDisposable not called
Fix:
// Unsubscribe events
obj.Event -= Handler;

// Use weak references
var weakRef = new WeakReference<T>(obj);

// Dispose properly
using var resource = new Resource();
Symptom: High GC pressure, frequent Gen0 collectionsFix:
  • Use object pooling
  • Use Span<T> and stackalloc
  • Reuse buffers
Symptom: Memory usage higher than expectedFix:
  • Avoid allocating >85KB objects
  • Use array pooling
  • Use GC.TryStartNoGCRegion() for critical sections

Concurrency and Threading

Parallel Processing

using System.Threading.Tasks;

// Parallel loops
Parallel.For(0, count, i =>
{
    ProcessItem(i);
});

// Parallel LINQ
var results = items.AsParallel()
    .Where(x => x.IsValid)
    .Select(x => Transform(x))
    .ToList();

// Task-based parallelism
var tasks = new Task[10];
for (int i = 0; i < 10; i++)
{
    int index = i;
    tasks[i] = Task.Run(() => ProcessItem(index));
}
await Task.WhenAll(tasks);

Lock-Free Programming

using System.Threading;

// Interlocked operations
Interlocked.Increment(ref counter);
Interlocked.CompareExchange(ref value, newValue, comparand);

// Concurrent collections
var queue = new ConcurrentQueue<T>();
var dict = new ConcurrentDictionary<K, V>();

Performance Monitoring

Built-in Performance Counters

using System.Diagnostics;

// CPU usage
var cpuCounter = new PerformanceCounter(
    "Processor", "% Processor Time", "_Total");
float cpuUsage = cpuCounter.NextValue();

// Memory usage
long memoryUsage = GC.GetTotalMemory(false);

Custom Metrics

public class PerformanceMetrics
{
    private long _frameCount;
    private Stopwatch _fpsTimer = Stopwatch.StartNew();
    
    public void RecordFrame()
    {
        Interlocked.Increment(ref _frameCount);
        
        if (_fpsTimer.ElapsedMilliseconds >= 1000)
        {
            long fps = _frameCount;
            Logger.Info?.Print(LogClass.Application, $"FPS: {fps}");
            
            Interlocked.Exchange(ref _frameCount, 0);
            _fpsTimer.Restart();
        }
    }
}

Performance Testing

Load Testing

[Test]
public void LoadTest()
{
    const int operations = 1000000;
    var sw = Stopwatch.StartNew();
    
    for (int i = 0; i < operations; i++)
    {
        DoOperation();
    }
    
    sw.Stop();
    double opsPerSecond = operations / sw.Elapsed.TotalSeconds;
    
    TestContext.WriteLine($"Operations/sec: {opsPerSecond:N0}");
    Assert.That(opsPerSecond, Is.GreaterThan(100000));
}

Optimization Checklist

Before optimizing, verify:
  • Profiled to identify actual bottlenecks
  • Measured baseline performance
  • Focused on hot paths (80/20 rule)
  • Tested in Release configuration
  • Considered algorithmic improvements first
  • Avoided premature optimization
  • Benchmarked changes before/after
  • Tested on target hardware

Common Bottlenecks in Emulation

CPU Emulation

  • JIT compilation overhead
  • Instruction decoding
  • Register state management
  • Memory access translation

GPU Emulation

  • Shader compilation/translation
  • Texture uploads/downloads
  • Draw call overhead
  • GPU synchronization

Memory Management

  • Page table lookups
  • Memory mapping/unmapping
  • Cache invalidation
  • GC pressure from allocations

I/O Operations

  • File system access
  • Save state serialization
  • Shader cache persistence
  • Log file writes

Performance Tips

Profile on target hardware - Performance characteristics vary significantly between systems
Optimize algorithms first - A better algorithm beats micro-optimizations
Cache expensive operations - Especially JIT compilation and shader translation
Use async/await correctly - Don’t block threads unnecessarily
Monitor GC metrics - Excessive GC pauses hurt emulation smoothness

Resources

Next Steps

Testing

Benchmark your optimizations

Debugging

Profile and debug issues

Contributing

Submit performance improvements

Build docs developers (and LLMs) love