Skip to main content

Overview

Optimizing FFmpeg code requires understanding which functions matter most, how to write efficient implementations, and how to leverage architecture-specific features like SIMD instructions.
This guide is based on FFmpeg’s official optimization documentation and best practices accumulated over years of development.

What to Optimize

Identify Hot Paths First

Before optimizing, identify the functions that consume the most CPU time:
1

Profile Your Code

Use profiling tools to identify bottlenecks (see Profiling section)
2

Check Existing Optimizations

Look in the x86/ directory - many important functions are already optimized
3

Focus on High-Impact Functions

Prioritize optimizing functions called frequently in common codecs

Architecture-Specific Considerations

Most critical functions already have x86 optimizations. Focus on:
  • Fine-tuning existing SIMD code
  • Adding AVX2/AVX-512 versions
  • Optimizing newer codecs

Function Importance Guide

Critical Functions (Highest Impact)

These functions are used extensively in motion compensation and encoding:

Motion Compensation Functions

put_pixels, put_no_rnd_pixels variants:
put_pixels{,_x2,_y2,_xy2}
  • Usage: Motion compensation in encoding/decoding
  • Priority: Critical
  • Impact: High - used in every motion-compensated frame
avg_pixels variants:
avg_pixels{,_x2,_y2,_xy2}
  • Usage: Motion compensation of B-frames
  • Priority: High
  • Impact: Medium - only B-frames, but common

Motion Estimation Functions

pix_abs16x16 variants:
pix_abs16x16{,_x2,_y2,_xy2}
  • Usage: Motion estimation with SAD (Sum of Absolute Differences)
  • Priority: Critical
  • Impact: Very high - directly affects encoding speed
pix_abs8x8 variants:
pix_abs8x8{,_x2,_y2,_xy2}
  • Usage: MPEG-4 4MV motion estimation
  • Priority: Medium
  • Impact: Lower than 16x16 variants

MPEG-4 Specific Functions

Quarter-pixel motion compensation:
mpeg4_qpel* / *qpel_mc*
  • Usage: MPEG-4 qpel encoding & decoding
  • Priority: High for MPEG-4
  • Impact: Significant for qpel-enabled content
  • Note: qpel8 used only for 4mv, avg_* only for B-frames
GMC (Global Motion Compensation):
gmc / gmc1
  • Usage: MPEG-4 GMC decoding
  • Priority: Medium
  • Impact: Significant when GMC is used
  • Note: gmc1 for single warp point (common in DivX5)

Encoding Functions

Pixel processing:
get_pixels / diff_pixels
  • Usage: Encoding
  • Priority: High
  • Complexity: Easy to optimize
Block clearing:
clear_blocks
  • Usage: Encoding
  • Priority: High
  • Complexity: Easiest to optimize
Pixel sum:
pix_sum
  • Usage: Encoding
  • Priority: Medium

Transform Functions

IDCT/FDCT:
idct / fdct
  • Usage: idct (encoding & decoding), fdct (encoding only)
  • Priority: Critical
  • Complexity: Difficult to optimize
  • Note: Some optimized IDCTs include clamping, making separate clamping functions unused
Clamping:
put_pixels_clamped / add_pixels_clamped
  • Usage: IDCT output processing
  • Priority: High
  • Complexity: Easy
Quantization:
dct_quantize / dct_quantize_trellis
  • Usage: Encoding
  • Priority: High / Medium
  • Complexity: Difficult
  • Note: Trellis quantization is slower, less commonly used
Dequantization:
dct_unquantize_mpeg1
dct_unquantize_mpeg2
dct_unquantize_h263
  • Usage: Codec-specific decoding/encoding
  • Priority: High

Low Priority Functions

Don’t waste time optimizing these unless you have a specific use case:
avg_no_rnd_pixels*          // Unused
put_mspel8_mc* / wmv2_mspel8*  // WMV2 only (uncommon codec)
qpel{8,16}_mc??_old_c       // Backward compatibility
add_bytes / diff_bytes      // HuffYUV only
dct_sad / quant_psnr        // Rarely used quality metrics

Optimization Justification

When to Optimize

Always Justified

  • 0.1%+ speedup for common codecs
  • No regression in code size/readability
  • At least one factor improves

Sometimes Justified

  • Smaller gains for less common codecs
  • Trade-offs between speed and maintainability

Rarely Justified

  • Obscure codec with minimal usage
  • Significant complexity increase
  • Negligible performance gain

Goal for Obscure Codecs

Keep code clean, small, and readable over raw performance

Performance Measurement

# Benchmark specific functions
ffmpeg -benchmark -i input.mp4 -f null -

# Get detailed timing
ffmpeg -benchmark_all -i input.mp4 -f null -

Assembly Optimization

Inline vs External Assembly

When to use:
  • Code should be inlined in C function
  • Small, frequently called functions
  • Need access to C struct members
Advantages:
  • Compiler handles register allocation
  • Direct access to C variables
  • Better inlining opportunities
Example:
static inline int sad_inline(const uint8_t *a, const uint8_t *b) {
    int result;
    __asm__ (
        "pxor      %%xmm0, %%xmm0\n"
        "movdqu    (%1), %%xmm1\n"
        "movdqu    (%2), %%xmm2\n"
        "psadbw    %%xmm2, %%xmm1\n"
        "paddw     %%xmm1, %%xmm0\n"
        "movd      %%xmm0, %0\n"
        : "=r"(result)
        : "r"(a), "r"(b)
        : XMM_CLOBBERS("xmm0", "xmm1", "xmm2",)
    );
    return result;
}

General Assembly Tips

Critical Rules:
  1. Use assembly loops, not C loops:
// Good
__asm__(
    "1:                     \n"
    "  movdqa  (%0), %%xmm0 \n"
    "  /* process */        \n"
    "  add     $16, %0      \n"
    "  dec     %1           \n"
    "  jg      1b           \n"
    : "+r"(ptr), "+r"(count)
);

// Bad
do {
    __asm__("movdqa (%0), %%xmm0" : : "r"(ptr));
    ptr += 16;
} while(--count);
  1. Mark all clobbered registers:
// x86 inline assembly
__asm__("..." 
    : /* outputs */
    : /* inputs */
    : XMM_CLOBBERS("xmm0", "xmm1",) "eax", "memory"
);
  1. Don’t rely on registers between asm blocks:
// Bad - xmm7 may be clobbered
__asm__("movdqa %0, %%xmm7" :: "m"(src));
/* other code */
__asm__("movdqa %%xmm7, %0" : "=m"(dst));

// Good - single asm block
__asm__(
    "movdqa   %1, %%xmm7   \n"
    "/* processing */       \n"
    "movdqa   %%xmm7, %0   \n"
    : "=m"(dst)
    : "m"(src)
    : XMM_CLOBBERS("xmm7",)
);
  1. Prefer external asm over intrinsics:
// Avoid - compiler-dependent
__m128i a = _mm_load_si128((__m128i*)ptr);
__m128i b = _mm_sad_epu8(a, zero);

// Prefer - explicit control
__asm__("psadbw %1, %0" : "+x"(a) : "x"(zero));

Alignment Requirements

Many SIMD instructions require aligned data:
void (*put_pixels_clamped)(
    const int16_t *block /*align 16*/,
    uint8_t *pixels /*align 8*/,
    ptrdiff_t stride
);
Ensure alignment:
// Declare aligned buffers
DECLARE_ALIGNED(16, uint8_t, buffer)[256];

// Check alignment at runtime
if ((uintptr_t)ptr & 15) {
    // Use unaligned version
} else {
    // Use aligned version
}

SIMD Optimization Strategies

Vectorization Patterns

Summing elements within a vector:
// Sum 8 values in xmm0
__asm__(
    "movhlps   %%xmm0, %%xmm1   \n"  // xmm1 = upper half
    "paddw     %%xmm1, %%xmm0   \n"  // Add halves
    "pshufd    $1, %%xmm0, %%xmm1\n"  // Shuffle
    "paddw     %%xmm1, %%xmm0   \n"  // Add again
    ::: "xmm0", "xmm1"
);
Process multiple elements per iteration:
// Process 4 blocks per iteration
for (int i = 0; i < h; i += 4) {
    // Process block i
    // Process block i+1
    // Process block i+2
    // Process block i+3
}
Interleave or deinterleave data for efficient processing:
// Deinterleave RGB to planar
__asm__(
    "movdqu      (%0), %%xmm0     \n"  // Load RGB
    "pshufd      $0x39, %%xmm0, %%xmm1\n"  // Shuffle
    // Extract R, G, B planes
);
Use masks to conditionally process elements:
// Clamp values
__asm__(
    "pminsw    %1, %%xmm0    \n"  // min(val, max)
    "pmaxsw    %2, %%xmm0    \n"  // max(val, min)
    :: "x"(maxval), "x"(minval)
);

Profiling

Tools

# Record profile
perf record -g ffmpeg -i input.mp4 output.mp4

# View results
perf report

# Annotate assembly
perf annotate

FFmpeg Built-in Benchmarking

# Overall benchmark
ffmpeg -benchmark -i input.mp4 -f null -

# Per-function timing
ffmpeg -benchmark_all -i input.mp4 -f null -

Testing Optimizations

Correctness Testing

1

Use fate tests

make fate-rsync
make fate
2

Compare output

ffmpeg -i input.mp4 reference.yuv
ffmpeg -i input.mp4 -cpuflags +sse4 optimized.yuv
cmp reference.yuv optimized.yuv
3

Checksum validation

ffmpeg -i input.mp4 -f md5 -

Performance Testing

# Disable optimization to get baseline
ffmpeg -cpuflags 0 -i input.mp4 -f null - 2>&1 | grep bench

# Test with optimization
ffmpeg -cpuflags +sse4 -i input.mp4 -f null - 2>&1 | grep bench

Best Practices

Profile First

Always profile before optimizing to find real bottlenecks

Test Correctness

Verify optimized code produces identical output

Measure Impact

Quantify performance improvement objectively

Document Assumptions

Note alignment requirements and constraints

Handle Edge Cases

Ensure correct behavior for unusual inputs

Maintain Readability

Balance optimization with code maintainability

Additional Resources

Architecture Guide

Understanding FFmpeg’s structure

Multithreading

Parallelization strategies

x86 Optimization

Agner Fog’s optimization guides

Intel Intrinsics

SIMD intrinsics reference

Build docs developers (and LLMs) love