Before rewriting repository history, it’s essential to understand what’s in your repository. git-filter-repo includes a powerful analysis mode that generates detailed reports.
Basic Analysis
Generate a comprehensive analysis report:
Clone the repository
git clone https://github.com/example/repo.git
cd repo
Run analysis
git filter-repo --analyze
This creates reports in the .git/filter-repo/analysis/ directory without modifying your repository.Review the reports
ls -la .git/filter-repo/analysis/
Generated reports:
blob-shas-and-paths.txt - All blob hashes and their paths
directories-all-sizes.txt - Directory sizes across all history
directories-deleted-sizes.txt - Size of deleted directories
extensions-all-sizes.txt - File types and their sizes
extensions-deleted-sizes.txt - Deleted files by extension
path-all-sizes.txt - Individual file sizes
path-deleted-sizes.txt - Deleted files
renames.txt - File rename history
The analysis mode does not modify your repository. It’s safe to run multiple times.
Understanding the Reports
Largest Files Report
View the biggest files in history:
head -20 .git/filter-repo/analysis/path-all-sizes.txt
Output format:
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
500000000 245000000 <present> large-file.bin
250000000 125000000 <present> data/dataset.csv
100000000 50000000 2024-01-15 old/backup.zip
This shows:
- Unpacked size: Actual file size
- Packed size: Compressed size in Git
- Date deleted: When file was removed (or
<present> if still exists)
- Path name: File location
Deleted Files Report
Find large files that were deleted but still bloat history:
head -20 .git/filter-repo/analysis/path-deleted-sizes.txt
These are prime candidates for removal since they:
- No longer exist in the current codebase
- Still take up space in clones
- Slow down clone operations
Extension Analysis
See which file types consume the most space:
cat .git/filter-repo/analysis/extensions-all-sizes.txt
Example output:
=== All extensions by reverse accumulated size ===
1.2 GiB 600 MiB .bin
800 MiB 400 MiB .mp4
500 MiB 250 MiB .zip
100 MiB 50 MiB .js
This helps identify:
- Binary files that shouldn’t be in Git
- Large media files better suited for Git LFS
- Compressed files that could be removed
Directory Size Report
head -20 .git/filter-repo/analysis/directories-all-sizes.txt
Identifies which directories contribute most to repository size.
Acting on Analysis Results
Once you’ve analyzed your repository, you can clean it up:
Remove large deleted files
Identify files to remove
Review deleted files:head -20 .git/filter-repo/analysis/path-deleted-sizes.txt
Create removal list
Create ../files-to-remove.txt with paths of files to delete:old/large-backup.zip
data/old-dataset.csv
build/artifacts.tar.gz
Remove the files
git filter-repo --invert-paths --paths-from-file ../files-to-remove.txt
Remove files by extension
Remove all files of a certain type:
# Remove all .zip files
git filter-repo --invert-paths --path-glob '*.zip'
# Remove all binaries
git filter-repo --invert-paths --path-glob '*.bin' --path-glob '*.exe'
Remove large directories
# Remove a bloated directory
git filter-repo --invert-paths --path node_modules/
Advanced Analysis Techniques
Find duplicate files
The blob-shas-and-paths.txt report shows all blob hashes. Look for identical hashes:
# Find blobs with multiple paths (potential duplicates)
awk '{print $1}' .git/filter-repo/analysis/blob-shas-and-paths.txt | \
sort | uniq -c | sort -rn | head -20
Track file renames
The renames.txt report shows file rename history:
cat .git/filter-repo/analysis/renames.txt
Example output:
old-name.js -> new-name.js
src/component.tsx -> lib/ui/component.tsx
Analyze by time period
Combine analysis with Git log to find when large files were added:
# Find when large files were added
for file in $(head -10 .git/filter-repo/analysis/path-deleted-sizes.txt | \
awk '{print $4}'); do
echo "=== $file ==="
git log --all --full-history --format="%ai %an" -- "$file" | tail -1
done
Continuous Monitoring
Pre-commit checks
Prevent large files from being committed:
# .git/hooks/pre-commit
#!/bin/bash
MAX_SIZE=10485760 # 10MB
for file in $(git diff --cached --name-only); do
if [ -f "$file" ]; then
size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null)
if [ $size -gt $MAX_SIZE ]; then
echo "Error: $file is larger than 10MB"
exit 1
fi
fi
done
Regular analysis
Run analysis periodically to catch bloat early:
#!/bin/bash
# analyze-repo.sh
git clone --bare https://github.com/example/repo.git repo-analysis
cd repo-analysis
git filter-repo --analyze
cat .git/filter-repo/analysis/path-all-sizes.txt | head -20
Interpreting Size Differences
Unpacked vs Packed Size
- Unpacked: Actual file size on disk
- Packed: Compressed size in Git objects
- Large difference indicates good compression
- Small difference suggests already-compressed files (zip, jpg, etc.)
Current vs Historical Size
Compare current repository size to historical:
# Current repository size
du -sh .
# Total historical size (from analysis)
grep 'All paths' .git/filter-repo/analysis/path-all-sizes.txt -A 1
Large discrepancy indicates deleted files still bloating history.
Example: Complete Repository Audit
Run analysis
git clone https://github.com/example/repo.git
cd repo
git filter-repo --analyze
Generate summary report
echo "=== TOP 10 LARGEST FILES ==="
head -11 .git/filter-repo/analysis/path-all-sizes.txt
echo ""
echo "=== TOP 10 DELETED FILES ==="
head -11 .git/filter-repo/analysis/path-deleted-sizes.txt
echo ""
echo "=== SIZE BY EXTENSION ==="
head -11 .git/filter-repo/analysis/extensions-all-sizes.txt
echo ""
echo "=== LARGEST DIRECTORIES ==="
head -11 .git/filter-repo/analysis/directories-all-sizes.txt
Identify cleanup targets
Based on the reports:
- Large deleted files: Remove from history
- Binary files: Consider Git LFS
- Build artifacts: Should be in .gitignore
- Old dependencies: Safe to remove
Plan cleanup
Create a cleanup plan:1. Remove all .zip files (build artifacts)
2. Remove data/*.bin (old datasets)
3. Remove node_modules/ from history
4. Move large media files to Git LFS
Best Practices
- Run analysis before filtering: Always analyze before making changes
- Keep reports: Save analysis reports for historical reference
- Focus on deleted files: Biggest wins come from removing deleted large files
- Check extensions: Identify file types that shouldn’t be in Git
- Compare before/after: Run analysis again after cleanup to verify
Troubleshooting Analysis
Analysis is slow
For very large repositories:
# Analyze only recent history
git filter-repo --analyze --refs HEAD~1000..HEAD
# Analyze specific branches
git filter-repo --analyze --refs main develop
Reports are too large
Filter reports to focus on significant files:
# Only files larger than 1MB
awk '$2 > 1048576' .git/filter-repo/analysis/path-all-sizes.txt
Missing expected files
Ensure you’re analyzing all branches:
# Analyze all refs (default)
git filter-repo --analyze
# Explicitly analyze all branches and tags
git filter-repo --analyze --refs --all