How It Works

Overview

git-filter-repo works by creating a processing pipeline that transforms Git history in a streaming fashion:

git fast-export <options> | filter | git fast-import <options>

This pipeline-based architecture is what makes git-filter-repo extremely fast and memory-efficient, even for large repositories.

The Three-Stage Pipeline

Stage 1: Fast-Export

The git fast-export command reads your Git repository and outputs a stream representation of all commits, trees, and blobs in a text-based format.

Fast-export produces a textual stream that includes:

Blob objects (file contents)
Commit objects (metadata, messages, parents)
File changes (modifications, deletions)
Tags and branch references

This format is designed to be:

Human-readable: You can inspect the stream to understand what’s being processed
Streamable: Objects can be processed one at a time without loading everything into memory
Complete: Contains all the information needed to reconstruct the entire repository history

Stage 2: Filter (git-filter-repo’s Core)

The middle stage is where git-filter-repo acts as an intelligent filter. It:

Parses the fast-export stream into Python objects (Blob, Commit, FileChange, Tag, etc.)
Processes each object according to your filtering rules
Modifies objects as needed (renaming paths, changing commit messages, removing files)
Outputs the modified stream in fast-import format

How git-filter-repo processes the stream

git-filter-repo creates objects like:

Blob: Represents file contents
Commit: Contains author, committer, message, file changes, and parents
FileChange: Tracks modifications, deletions, or additions to files
Tag: Annotated tag information
Reset: Branch creation or updates

These objects can be modified via callbacks or built-in filters before being written to the output stream.

Stage 3: Fast-Import

The git fast-import command reads the filtered stream and reconstructs the Git repository with the modified history.

Fast-import creates new Git objects with new SHA-1 hashes, since any modification to history necessarily changes the content being hashed.

This stage:

Creates new blob objects for files
Constructs new commit objects with updated references
Updates branch pointers and tags
Builds the new packfile with the rewritten history

Why This Architecture Is Fast

Memory Efficiency

Unlike git filter-branch, which checks out each commit to a working directory, git-filter-repo:

Streams data: Processes one object at a time
No working directory: Never materializes files to disk
Minimal overhead: Only holds current object being processed in memory

No Subprocess Overhead

While the pipeline uses git fast-export and git fast-import as subprocesses, git-filter-repo itself:

Processes in a single Python process: No spawning shells for each commit
Native string/bytes operations: Efficient text processing without external tools
Rich data structures: Uses Python’s dicts, lists, and objects instead of shell variables

Optimized for Git’s Architecture

Fast-export and fast-import are:

Native Git commands: Written in C, highly optimized
Purpose-built: Specifically designed for repository import/export
Actively maintained: Benefit from ongoing Git development

git-filter-repo has even driven improvements in Git itself! The tool’s author has contributed numerous enhancements to fast-export and fast-import based on filter-repo’s needs.

Additional Processing

Beyond the core pipeline, git-filter-repo also:

Validates repository state: Ensures you’re working with a fresh clone (see Fresh Clone Requirements)
Rewrites commit references: Updates commit IDs referenced in commit messages
Prunes empty commits: Removes commits that become empty after filtering
Handles topology changes: Manages merge commits when parents are pruned
Cleans up cruft: Expires reflogs, repacks the repository, and removes old references
Updates remotes: Removes origin remote to prevent accidental pushes

Extensibility

The architecture allows you to extend git-filter-repo in multiple ways:

Callbacks

Library Usage

Import git-filter-repo as a Python module to build custom history rewriting tools

Example: Following a Blob Through the Pipeline

Let’s trace how a file modification flows through the pipeline:

Fast-export outputs:

blob
mark :1
data 14
Hello, World!

commit refs/heads/main
mark :2
author John Doe <[email protected]> 1234567890 +0000
committer John Doe <[email protected]> 1234567890 +0000
data 13
Add greeting
M 100644 :1 hello.txt

git-filter-repo processes:
- Creates a Blob object with content “Hello, World!”
- Creates a Commit object with a FileChange for hello.txt
- If you’re filtering paths, might rename hello.txt to src/hello.txt
- Modifies the FileChange object to reflect the new path

Fast-import receives:

blob
mark :1
data 14
Hello, World!

commit refs/heads/main
mark :2
author John Doe <[email protected]> 1234567890 +0000
committer John Doe <[email protected]> 1234567890 +0000
data 13
Add greeting
M 100644 :1 src/hello.txt

Result: A new repository with the file at src/hello.txt instead of hello.txt

Because history is being rewritten, all commit SHA-1 hashes will be different in the new repository. This is unavoidable when modifying history.

Performance Characteristics

Aspect	git filter-branch	git-filter-repo
Processing model	Checkout each commit	Stream processing
Speed	Extremely slow	Multiple orders of magnitude faster
Memory usage	High (working directory)	Low (streaming)
Subprocess overhead	High (shells per commit)	Minimal (single Python process)
Scriptability	Shell (OS-specific)	Python (cross-platform)

Getting Started

Core Concepts

Common Use Cases

Guides

Migration Guides

Examples

Resources

Overview

The Three-Stage Pipeline

Stage 1: Fast-Export

Stage 2: Filter (git-filter-repo’s Core)

Stage 3: Fast-Import

Why This Architecture Is Fast

Memory Efficiency

No Subprocess Overhead

Optimized for Git’s Architecture

Additional Processing

Extensibility

Callbacks

Library Usage

Example: Following a Blob Through the Pipeline

Performance Characteristics

Next Steps

Design Rationale

Fresh Clone Requirements

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Common Use Cases

Guides

Migration Guides

Examples

Resources

​Overview

​The Three-Stage Pipeline

​Stage 1: Fast-Export

​Stage 2: Filter (git-filter-repo’s Core)

​Stage 3: Fast-Import

​Why This Architecture Is Fast

​Memory Efficiency

​No Subprocess Overhead

​Optimized for Git’s Architecture

​Additional Processing

​Extensibility

Callbacks

Library Usage

​Example: Following a Blob Through the Pipeline

​Performance Characteristics

​Next Steps

Design Rationale

Fresh Clone Requirements

Build docs developers (and LLMs) love

Overview

The Three-Stage Pipeline

Stage 1: Fast-Export

Stage 2: Filter (git-filter-repo’s Core)

Stage 3: Fast-Import

Why This Architecture Is Fast

Memory Efficiency

No Subprocess Overhead

Optimized for Git’s Architecture

Additional Processing

Extensibility

Example: Following a Blob Through the Pipeline

Performance Characteristics

Next Steps