Overview
Performance is critical for emulation. This guide covers profiling tools, optimization techniques, and performance analysis for Ryujinx development.
Measure first, optimize second. Always profile before optimizing to avoid premature optimization.
Release Configuration
# Build optimized release version
dotnet build -c Release
# Publish optimized for specific platform
dotnet publish -c Release -r win-x64
Optimization Flags
From src/ARMeilleure/ARMeilleure.csproj:
< PropertyGroup >
< AllowUnsafeBlocks > True </ AllowUnsafeBlocks >
< Optimize > True </ Optimize >
</ PropertyGroup >
JIT Optimizations
Ryujinx uses ARMeilleure (ARM JIT compiler) for CPU emulation:
// From src/ARMeilleure/Optimizations.cs
public static class Optimizations
{
public static bool AllowLcqInFunctionTable { get ; set ; } = true ;
public static bool UseUnmanagedDispatchLoop { get ; set ; } = true ;
}
These optimizations are disabled during testing for faster test execution (from src/Ryujinx.Tests/Cpu/CpuTest.cs:65-66).
Built-in .NET Profilers
dotnet-counters
dotnet-trace
dotnet-dump
Real-time performance metrics: # Install
dotnet tool install -g dotnet-counters
# Monitor running process
dotnet-counters monitor --process-id < PI D >
# Monitor specific counters
dotnet-counters monitor --process-id < PI D > \
System.Runtime[cpu-usage,working-set,gc-heap-size]
Collect performance traces: # Install
dotnet tool install -g dotnet-trace
# Collect trace
dotnet-trace collect --process-id < PI D > \
--providers Microsoft-DotNETCore-SampleProfiler
# Analyze with PerfView or Visual Studio
Analyze memory dumps: # Install
dotnet tool install -g dotnet-dump
# Capture dump
dotnet-dump collect --process-id < PI D >
# Analyze dump
dotnet-dump analyze dump.dmp
Visual Studio Profiler
Start profiling session
Debug → Performance Profiler or Alt+F2
Select profiling tools
CPU Usage : Find hot paths
.NET Object Allocation : Memory allocations
Instrumentation : Detailed timing
GPU Usage : Graphics performance
Start profiling
Click Start to launch with profiling
Analyze results
Review flame graphs, call trees, and hot paths
JetBrains dotTrace
Profile application
Run → Profile in Rider
Choose profiling mode
Sampling : Low overhead, statistical
Tracing : Accurate, higher overhead
Line-by-line : Most detailed
Analyze timeline
View CPU usage over time and identify spikes
Inspect call tree
Find methods consuming most CPU time
PerfView (Free, Windows)
# Download from https://github.com/microsoft/perfview
# Collect trace
PerfView.exe collect
# Analyze trace
PerfView.exe < trace-fil e > .etl
Benchmarking
BenchmarkDotNet
The gold standard for .NET micro-benchmarking:
Install package
< PackageReference Include = "BenchmarkDotNet" Version = "0.13.12" />
Create benchmark class
using BenchmarkDotNet . Attributes ;
using BenchmarkDotNet . Running ;
[ MemoryDiagnoser ]
public class ShaderCacheBenchmarks
{
private ShaderCache _cache ;
[ GlobalSetup ]
public void Setup ()
{
_cache = new ShaderCache ();
}
[ Benchmark ]
public void GetShaderProgram ()
{
_cache . GetProgram ( 0x1234 );
}
[ Benchmark ]
public void CompileShader ()
{
_cache . CompileShader ( shaderCode );
}
}
Run benchmarks
class Program
{
static void Main ( string [] args )
{
BenchmarkRunner . Run < ShaderCacheBenchmarks >();
}
}
Benchmark Attributes
[Benchmark] - Mark method to benchmark
[ Benchmark ]
public void MyMethod () { }
[GlobalSetup] - Run once before all benchmarks
[ GlobalSetup ]
public void Setup () { }
[IterationSetup] - Run before each iteration
[ IterationSetup ]
public void IterationSetup () { }
[MemoryDiagnoser] - Track allocations
[ MemoryDiagnoser ]
public class MyBenchmarks { }
[Params] - Test multiple values
[ Params ( 100 , 1000 , 10000 )]
public int Size { get ; set ; }
using System . Diagnostics ;
// Quick measurement
var sw = Stopwatch . StartNew ();
DoWork ();
sw . Stop ();
Logger . Info ? . Print ( LogClass . Application ,
$"Operation took { sw . ElapsedMilliseconds } ms" );
// High-resolution timing
long start = Stopwatch . GetTimestamp ();
DoWork ();
long end = Stopwatch . GetTimestamp ();
double elapsedMs = ( end - start ) * 1000.0 / Stopwatch . Frequency ;
Optimization Techniques
Memory Allocation Optimization
// Bad: allocates array
byte [] buffer = new byte [ 1024 ];
ProcessData ( buffer );
// Good: stack allocation
Span < byte > buffer = stackalloc byte [ 1024 ];
ProcessData ( buffer );
using System . Buffers ;
// Rent from pool instead of allocating
byte [] buffer = ArrayPool < byte >. Shared . Rent ( 1024 );
try
{
ProcessData ( buffer );
}
finally
{
ArrayPool < byte >. Shared . Return ( buffer );
}
// Bad: boxes value type
object obj = 42 ;
// Good: use generics
T Value < T >( T value ) => value ;
// Bad: allocates every iteration
for ( int i = 0 ; i < 1000 ; i ++ )
{
var temp = new StringBuilder ();
temp . Append ( i );
}
// Good: allocate once
var temp = new StringBuilder ();
for ( int i = 0 ; i < 1000 ; i ++ )
{
temp . Clear ();
temp . Append ( i );
}
CPU Optimization
Inline Methods
SIMD Operations
Unsafe Code
Branch Prediction
using System . Runtime . CompilerServices ;
[ MethodImpl ( MethodImplOptions . AggressiveInlining )]
private int FastMethod ( int x )
{
return x * 2 ;
}
using System . Runtime . Intrinsics ;
using System . Runtime . Intrinsics . X86 ;
if ( Avx2 . IsSupported )
{
// Use AVX2 SIMD operations
Vector256 < int > vec = Avx2 . LoadVector256 ( ptr );
}
unsafe
{
fixed ( byte * ptr = & buffer [ 0 ])
{
// Fast pointer operations
* ptr = 42 ;
}
}
// Help branch predictor
if ( likely_condition ) // Most common path
{
FastPath ();
}
else
{
SlowPath ();
}
Data Structure Optimization
// Use appropriate collections
// Fast lookup: O(1)
var dict = new Dictionary < int , string >();
// Fast iteration
var list = new List < int >();
// Concurrent access
var concurrent = new ConcurrentDictionary < int , string >();
// Memory-efficient
var span = new Span < byte >( buffer );
Shader Compilation
From src/Ryujinx.Graphics.Gpu/Shader/ShaderCache.cs:
// Cache compiled shaders to avoid recompilation
private readonly Dictionary < ulong , CachedShaderProgram > _programCache ;
public ShaderProgram GetProgram ( ulong address )
{
if ( _programCache . TryGetValue ( address , out var cached ))
{
return cached . Program ; // Fast path
}
// Compile and cache
var program = CompileShader ( address );
_programCache [ address ] = new CachedShaderProgram ( program );
return program ;
}
Texture Caching
Reuse texture resources
Compress textures when possible
Use appropriate texture formats
Implement mipmap generation efficiently
Memory Profiling
Take memory snapshot
Visual Studio: Debug → Memory Usage → Take Snapshot
Perform operation
Execute the code you want to analyze
Take second snapshot
Compare snapshots to see allocations
Analyze differences
Identify objects that weren’t garbage collected
Common Memory Issues
Symptom : Memory usage grows over timeCauses :
Event handlers not unsubscribed
Static collections holding references
IDisposable not called
Fix :// Unsubscribe events
obj . Event -= Handler ;
// Use weak references
var weakRef = new WeakReference < T >( obj );
// Dispose properly
using var resource = new Resource ();
Symptom : High GC pressure, frequent Gen0 collectionsFix :
Use object pooling
Use Span<T> and stackalloc
Reuse buffers
Large Object Heap Fragmentation
Symptom : Memory usage higher than expectedFix :
Avoid allocating >85KB objects
Use array pooling
Use GC.TryStartNoGCRegion() for critical sections
Concurrency and Threading
Parallel Processing
using System . Threading . Tasks ;
// Parallel loops
Parallel . For ( 0 , count , i =>
{
ProcessItem ( i );
});
// Parallel LINQ
var results = items . AsParallel ()
. Where ( x => x . IsValid )
. Select ( x => Transform ( x ))
. ToList ();
// Task-based parallelism
var tasks = new Task [ 10 ];
for ( int i = 0 ; i < 10 ; i ++ )
{
int index = i ;
tasks [ i ] = Task . Run (() => ProcessItem ( index ));
}
await Task . WhenAll ( tasks );
Lock-Free Programming
using System . Threading ;
// Interlocked operations
Interlocked . Increment ( ref counter );
Interlocked . CompareExchange ( ref value , newValue , comparand );
// Concurrent collections
var queue = new ConcurrentQueue < T >();
var dict = new ConcurrentDictionary < K , V >();
using System . Diagnostics ;
// CPU usage
var cpuCounter = new PerformanceCounter (
"Processor" , "% Processor Time" , "_Total" );
float cpuUsage = cpuCounter . NextValue ();
// Memory usage
long memoryUsage = GC . GetTotalMemory ( false );
Custom Metrics
public class PerformanceMetrics
{
private long _frameCount ;
private Stopwatch _fpsTimer = Stopwatch . StartNew ();
public void RecordFrame ()
{
Interlocked . Increment ( ref _frameCount );
if ( _fpsTimer . ElapsedMilliseconds >= 1000 )
{
long fps = _frameCount ;
Logger . Info ? . Print ( LogClass . Application , $"FPS: { fps } " );
Interlocked . Exchange ( ref _frameCount , 0 );
_fpsTimer . Restart ();
}
}
}
Load Testing
[ Test ]
public void LoadTest ()
{
const int operations = 1000000 ;
var sw = Stopwatch . StartNew ();
for ( int i = 0 ; i < operations ; i ++ )
{
DoOperation ();
}
sw . Stop ();
double opsPerSecond = operations / sw . Elapsed . TotalSeconds ;
TestContext . WriteLine ( $"Operations/sec: { opsPerSecond : N0 } " );
Assert . That ( opsPerSecond , Is . GreaterThan ( 100000 ));
}
Optimization Checklist
Before optimizing, verify:
Common Bottlenecks in Emulation
CPU Emulation
JIT compilation overhead
Instruction decoding
Register state management
Memory access translation
GPU Emulation
Shader compilation/translation
Texture uploads/downloads
Draw call overhead
GPU synchronization
Memory Management
Page table lookups
Memory mapping/unmapping
Cache invalidation
GC pressure from allocations
I/O Operations
File system access
Save state serialization
Shader cache persistence
Log file writes
Profile on target hardware - Performance characteristics vary significantly between systems
Optimize algorithms first - A better algorithm beats micro-optimizations
Cache expensive operations - Especially JIT compilation and shader translation
Use async/await correctly - Don’t block threads unnecessarily
Monitor GC metrics - Excessive GC pauses hurt emulation smoothness
Resources
Next Steps
Testing Benchmark your optimizations
Debugging Profile and debug issues
Contributing Submit performance improvements