Performance Optimization

Overview

Optimizing FFmpeg code requires understanding which functions matter most, how to write efficient implementations, and how to leverage architecture-specific features like SIMD instructions.

This guide is based on FFmpeg’s official optimization documentation and best practices accumulated over years of development.

What to Optimize

Identify Hot Paths First

Before optimizing, identify the functions that consume the most CPU time:

Profile Your Code

Use profiling tools to identify bottlenecks (see Profiling section)

Check Existing Optimizations

Look in the x86/ directory - many important functions are already optimized

Focus on High-Impact Functions

Prioritize optimizing functions called frequently in common codecs

Architecture-Specific Considerations

x86/x64
ARM/NEON
RISC-V
Other

Most critical functions already have x86 optimizations. Focus on:

Fine-tuning existing SIMD code
Adding AVX2/AVX-512 versions
Optimizing newer codecs

Function Importance Guide

Critical Functions (Highest Impact)

These functions are used extensively in motion compensation and encoding:

Motion Compensation Functions

put_pixels, put_no_rnd_pixels variants:

put_pixels{,_x2,_y2,_xy2}

Usage: Motion compensation in encoding/decoding
Priority: Critical
Impact: High - used in every motion-compensated frame

avg_pixels variants:

avg_pixels{,_x2,_y2,_xy2}

Usage: Motion compensation of B-frames
Priority: High
Impact: Medium - only B-frames, but common

Motion Estimation Functions

pix_abs16x16 variants:

pix_abs16x16{,_x2,_y2,_xy2}

Usage: Motion estimation with SAD (Sum of Absolute Differences)
Priority: Critical
Impact: Very high - directly affects encoding speed

pix_abs8x8 variants:

pix_abs8x8{,_x2,_y2,_xy2}

Usage: MPEG-4 4MV motion estimation
Priority: Medium
Impact: Lower than 16x16 variants

MPEG-4 Specific Functions

Quarter-pixel motion compensation:

mpeg4_qpel* / *qpel_mc*

Usage: MPEG-4 qpel encoding & decoding
Priority: High for MPEG-4
Impact: Significant for qpel-enabled content
Note: qpel8 used only for 4mv, avg_* only for B-frames

GMC (Global Motion Compensation):

gmc / gmc1

Usage: MPEG-4 GMC decoding
Priority: Medium
Impact: Significant when GMC is used
Note: gmc1 for single warp point (common in DivX5)

Encoding Functions

Pixel processing:

get_pixels / diff_pixels

Usage: Encoding
Priority: High
Complexity: Easy to optimize

Block clearing:

clear_blocks

Usage: Encoding
Priority: High
Complexity: Easiest to optimize

Pixel sum:

pix_sum

Usage: Encoding
Priority: Medium

Transform Functions

IDCT/FDCT:

idct / fdct

Usage: idct (encoding & decoding), fdct (encoding only)
Priority: Critical
Complexity: Difficult to optimize
Note: Some optimized IDCTs include clamping, making separate clamping functions unused

Clamping:

put_pixels_clamped / add_pixels_clamped

Usage: IDCT output processing
Priority: High
Complexity: Easy

Quantization:

dct_quantize / dct_quantize_trellis

Usage: Encoding
Priority: High / Medium
Complexity: Difficult
Note: Trellis quantization is slower, less commonly used

Dequantization:

dct_unquantize_mpeg1
dct_unquantize_mpeg2
dct_unquantize_h263

Usage: Codec-specific decoding/encoding
Priority: High

Low Priority Functions

Don’t waste time optimizing these unless you have a specific use case:

avg_no_rnd_pixels*          // Unused
put_mspel8_mc* / wmv2_mspel8*  // WMV2 only (uncommon codec)
qpel{8,16}_mc??_old_c       // Backward compatibility
add_bytes / diff_bytes      // HuffYUV only
dct_sad / quant_psnr        // Rarely used quality metrics

Optimization Justification

When to Optimize

Always Justified

0.1%+ speedup for common codecs
No regression in code size/readability
At least one factor improves

Sometimes Justified

Smaller gains for less common codecs
Trade-offs between speed and maintainability

Rarely Justified

Obscure codec with minimal usage
Significant complexity increase
Negligible performance gain

Goal for Obscure Codecs

Keep code clean, small, and readable over raw performance

Performance Measurement

# Benchmark specific functions
ffmpeg -benchmark -i input.mp4 -f null -

# Get detailed timing
ffmpeg -benchmark_all -i input.mp4 -f null -

Assembly Optimization

Inline vs External Assembly

Inline Assembly
External Assembly

When to use:

Code should be inlined in C function
Small, frequently called functions
Need access to C struct members

Advantages:

Compiler handles register allocation
Direct access to C variables
Better inlining opportunities

Example:

static inline int sad_inline(const uint8_t *a, const uint8_t *b) {
    int result;
    __asm__ (
        "pxor      %%xmm0, %%xmm0\n"
        "movdqu    (%1), %%xmm1\n"
        "movdqu    (%2), %%xmm2\n"
        "psadbw    %%xmm2, %%xmm1\n"
        "paddw     %%xmm1, %%xmm0\n"
        "movd      %%xmm0, %0\n"
        : "=r"(result)
        : "r"(a), "r"(b)
        : XMM_CLOBBERS("xmm0", "xmm1", "xmm2",)
    );
    return result;
}

When to use:

Calls external functions
Large assembly functions
Shared across multiple codecs

Advantages:

Cleaner separation
Easier to maintain
Can use advanced assembler features

Example:

; x86/sad.asm
cglobal pixel_sad_16x16, 4, 4, 5
    pxor       m0, m0
.loop:
    movdqu     m1, [r0]
    movdqu     m2, [r1]
    psadbw     m1, m2
    paddw      m0, m1
    add        r0, r2
    add        r1, r3
    dec        r4d
    jg         .loop
    movd       eax, xmm0
    RET

General Assembly Tips

Critical Rules:

Use assembly loops, not C loops:

// Good
__asm__(
    "1:                     \n"
    "  movdqa  (%0), %%xmm0 \n"
    "  /* process */        \n"
    "  add     $16, %0      \n"
    "  dec     %1           \n"
    "  jg      1b           \n"
    : "+r"(ptr), "+r"(count)
);

// Bad
do {
    __asm__("movdqa (%0), %%xmm0" : : "r"(ptr));
    ptr += 16;
} while(--count);

Mark all clobbered registers:

// x86 inline assembly
__asm__("..." 
    : /* outputs */
    : /* inputs */
    : XMM_CLOBBERS("xmm0", "xmm1",) "eax", "memory"
);

Don’t rely on registers between asm blocks:

// Bad - xmm7 may be clobbered
__asm__("movdqa %0, %%xmm7" :: "m"(src));
/* other code */
__asm__("movdqa %%xmm7, %0" : "=m"(dst));

// Good - single asm block
__asm__(
    "movdqa   %1, %%xmm7   \n"
    "/* processing */       \n"
    "movdqa   %%xmm7, %0   \n"
    : "=m"(dst)
    : "m"(src)
    : XMM_CLOBBERS("xmm7",)
);

Prefer external asm over intrinsics:

// Avoid - compiler-dependent
__m128i a = _mm_load_si128((__m128i*)ptr);
__m128i b = _mm_sad_epu8(a, zero);

// Prefer - explicit control
__asm__("psadbw %1, %0" : "+x"(a) : "x"(zero));

Alignment Requirements

Many SIMD instructions require aligned data:

void (*put_pixels_clamped)(
    const int16_t *block /*align 16*/,
    uint8_t *pixels /*align 8*/,
    ptrdiff_t stride
);

Ensure alignment:

// Declare aligned buffers
DECLARE_ALIGNED(16, uint8_t, buffer)[256];

// Check alignment at runtime
if ((uintptr_t)ptr & 15) {
    // Use unaligned version
} else {
    // Use aligned version
}

SIMD Optimization Strategies

Vectorization Patterns

Horizontal Reduction

Summing elements within a vector:

// Sum 8 values in xmm0
__asm__(
    "movhlps   %%xmm0, %%xmm1   \n"  // xmm1 = upper half
    "paddw     %%xmm1, %%xmm0   \n"  // Add halves
    "pshufd    $1, %%xmm0, %%xmm1\n"  // Shuffle
    "paddw     %%xmm1, %%xmm0   \n"  // Add again
    ::: "xmm0", "xmm1"
);

Loop Unrolling

Process multiple elements per iteration:

// Process 4 blocks per iteration
for (int i = 0; i < h; i += 4) {
    // Process block i
    // Process block i+1
    // Process block i+2
    // Process block i+3
}

Data Rearrangement

Interleave or deinterleave data for efficient processing:

// Deinterleave RGB to planar
__asm__(
    "movdqu      (%0), %%xmm0     \n"  // Load RGB
    "pshufd      $0x39, %%xmm0, %%xmm1\n"  // Shuffle
    // Extract R, G, B planes
);

Predication

Use masks to conditionally process elements:

// Clamp values
__asm__(
    "pminsw    %1, %%xmm0    \n"  // min(val, max)
    "pmaxsw    %2, %%xmm0    \n"  // max(val, min)
    :: "x"(maxval), "x"(minval)
);

Profiling

Tools

Linux - perf
macOS - Instruments
Windows - VTune
Cross-platform - Valgrind

# Record profile
perf record -g ffmpeg -i input.mp4 output.mp4

# View results
perf report

# Annotate assembly
perf annotate

# Profile with Instruments
instruments -t "Time Profiler" ffmpeg -i input.mp4 output.mp4

Use Intel VTune Profiler GUI or command line:

vtune -collect hotspots -- ffmpeg.exe -i input.mp4 output.mp4

# Profile with callgrind
valgrind --tool=callgrind ffmpeg -i input.mp4 output.mp4

# Visualize with kcachegrind
kcachegrind callgrind.out.*

FFmpeg Built-in Benchmarking

# Overall benchmark
ffmpeg -benchmark -i input.mp4 -f null -

# Per-function timing
ffmpeg -benchmark_all -i input.mp4 -f null -

Testing Optimizations

Correctness Testing

Use fate tests

make fate-rsync
make fate

Compare output

ffmpeg -i input.mp4 reference.yuv
ffmpeg -i input.mp4 -cpuflags +sse4 optimized.yuv
cmp reference.yuv optimized.yuv

Checksum validation

ffmpeg -i input.mp4 -f md5 -

Performance Testing

# Disable optimization to get baseline
ffmpeg -cpuflags 0 -i input.mp4 -f null - 2>&1 | grep bench

# Test with optimization
ffmpeg -cpuflags +sse4 -i input.mp4 -f null - 2>&1 | grep bench

Best Practices

Profile First

Always profile before optimizing to find real bottlenecks

Test Correctness

Verify optimized code produces identical output

Measure Impact

Quantify performance improvement objectively

Document Assumptions

Note alignment requirements and constraints

Handle Edge Cases

Ensure correct behavior for unusual inputs

Maintain Readability

Balance optimization with code maintainability

Additional Resources

Architecture Guide

Understanding FFmpeg’s structure

Multithreading

Parallelization strategies

x86 Optimization

Agner Fog’s optimization guides

Intel Intrinsics

SIMD intrinsics reference

Contributing

Developer Resources

​Overview

​What to Optimize

​Identify Hot Paths First

​Architecture-Specific Considerations

​Function Importance Guide

​Critical Functions (Highest Impact)

​Motion Compensation Functions

​Motion Estimation Functions

​MPEG-4 Specific Functions

​Encoding Functions

​Transform Functions

​Low Priority Functions

​Optimization Justification

​When to Optimize

Always Justified

Sometimes Justified

Rarely Justified

Goal for Obscure Codecs

​Performance Measurement

​Assembly Optimization

​Inline vs External Assembly

​General Assembly Tips

​Alignment Requirements

​SIMD Optimization Strategies

​Vectorization Patterns

​Profiling

​Tools

​FFmpeg Built-in Benchmarking

​Testing Optimizations

​Correctness Testing

​Performance Testing

​Best Practices

Profile First

Test Correctness

Measure Impact

Document Assumptions

Handle Edge Cases

Maintain Readability

​Additional Resources

Architecture Guide

Multithreading

x86 Optimization

Intel Intrinsics

Build docs developers (and LLMs) love

Overview

What to Optimize

Identify Hot Paths First

Architecture-Specific Considerations

Function Importance Guide

Critical Functions (Highest Impact)

Motion Compensation Functions

Motion Estimation Functions

MPEG-4 Specific Functions

Encoding Functions

Transform Functions

Low Priority Functions

Optimization Justification

When to Optimize

Performance Measurement

Assembly Optimization

Inline vs External Assembly

General Assembly Tips

Alignment Requirements

SIMD Optimization Strategies

Vectorization Patterns

Profiling

Tools

FFmpeg Built-in Benchmarking

Testing Optimizations

Correctness Testing

Performance Testing

Best Practices

Additional Resources