Overview
git-filter-repo works by creating a processing pipeline that transforms Git history in a streaming fashion:The Three-Stage Pipeline
Stage 1: Fast-Export
Thegit fast-export command reads your Git repository and outputs a stream representation of all commits, trees, and blobs in a text-based format.
Fast-export produces a textual stream that includes:
- Blob objects (file contents)
- Commit objects (metadata, messages, parents)
- File changes (modifications, deletions)
- Tags and branch references
- Human-readable: You can inspect the stream to understand what’s being processed
- Streamable: Objects can be processed one at a time without loading everything into memory
- Complete: Contains all the information needed to reconstruct the entire repository history
Stage 2: Filter (git-filter-repo’s Core)
The middle stage is where git-filter-repo acts as an intelligent filter. It:- Parses the fast-export stream into Python objects (Blob, Commit, FileChange, Tag, etc.)
- Processes each object according to your filtering rules
- Modifies objects as needed (renaming paths, changing commit messages, removing files)
- Outputs the modified stream in fast-import format
How git-filter-repo processes the stream
How git-filter-repo processes the stream
git-filter-repo creates objects like:
Blob: Represents file contentsCommit: Contains author, committer, message, file changes, and parentsFileChange: Tracks modifications, deletions, or additions to filesTag: Annotated tag informationReset: Branch creation or updates
Stage 3: Fast-Import
Thegit fast-import command reads the filtered stream and reconstructs the Git repository with the modified history.
Fast-import creates new Git objects with new SHA-1 hashes, since any modification to history necessarily changes the content being hashed.
- Creates new blob objects for files
- Constructs new commit objects with updated references
- Updates branch pointers and tags
- Builds the new packfile with the rewritten history
Why This Architecture Is Fast
Memory Efficiency
Unlikegit filter-branch, which checks out each commit to a working directory, git-filter-repo:
- Streams data: Processes one object at a time
- No working directory: Never materializes files to disk
- Minimal overhead: Only holds current object being processed in memory
No Subprocess Overhead
While the pipeline usesgit fast-export and git fast-import as subprocesses, git-filter-repo itself:
- Processes in a single Python process: No spawning shells for each commit
- Native string/bytes operations: Efficient text processing without external tools
- Rich data structures: Uses Python’s dicts, lists, and objects instead of shell variables
Optimized for Git’s Architecture
Fast-export and fast-import are:- Native Git commands: Written in C, highly optimized
- Purpose-built: Specifically designed for repository import/export
- Actively maintained: Benefit from ongoing Git development
git-filter-repo has even driven improvements in Git itself! The tool’s author has contributed numerous enhancements to fast-export and fast-import based on filter-repo’s needs.
Additional Processing
Beyond the core pipeline, git-filter-repo also:- Validates repository state: Ensures you’re working with a fresh clone (see Fresh Clone Requirements)
- Rewrites commit references: Updates commit IDs referenced in commit messages
- Prunes empty commits: Removes commits that become empty after filtering
- Handles topology changes: Manages merge commits when parents are pruned
- Cleans up cruft: Expires reflogs, repacks the repository, and removes old references
- Updates remotes: Removes origin remote to prevent accidental pushes
Extensibility
The architecture allows you to extend git-filter-repo in multiple ways:Callbacks
Register Python functions to process each object type (blob, commit, tag, etc.)
Library Usage
Import git-filter-repo as a Python module to build custom history rewriting tools
Example: Following a Blob Through the Pipeline
Let’s trace how a file modification flows through the pipeline:-
Fast-export outputs:
-
git-filter-repo processes:
- Creates a
Blobobject with content “Hello, World!” - Creates a
Commitobject with aFileChangeforhello.txt - If you’re filtering paths, might rename
hello.txttosrc/hello.txt - Modifies the FileChange object to reflect the new path
- Creates a
-
Fast-import receives:
-
Result: A new repository with the file at
src/hello.txtinstead ofhello.txt
Performance Characteristics
| Aspect | git filter-branch | git-filter-repo |
|---|---|---|
| Processing model | Checkout each commit | Stream processing |
| Speed | Extremely slow | Multiple orders of magnitude faster |
| Memory usage | High (working directory) | Low (streaming) |
| Subprocess overhead | High (shells per commit) | Minimal (single Python process) |
| Scriptability | Shell (OS-specific) | Python (cross-platform) |
Next Steps
Design Rationale
Learn why git-filter-repo was built and its 12 design goals
Fresh Clone Requirements
Understand why fresh clones are required for safety
