Overview
RAPTOR’s OSS Forensics system provides evidence-backed investigation of public GitHub repositories through automated evidence collection from multiple tamper-proof sources.Multi-Source Evidence
Collect from GH Archive, GitHub API, Wayback Machine, and local git repositories
Hypothesis Formation
AI-powered analysis that forms and validates hypotheses based on collected evidence
Verification Pipeline
All evidence is verified against original sources before inclusion in reports
Timeline Reconstruction
Build complete incident timelines with actor attribution and impact assessment
Command Usage
Command Flags
| Flag | Default | Description |
|---|---|---|
--max-followups | 3 | Maximum evidence collection rounds |
--max-retries | 3 | Maximum hypothesis revision rounds |
Evidence Sources
1. GitHub Archive (GH Archive)
Tamper-proof event data via BigQuery GitHub Archive provides immutable forensic evidence of all public GitHub events since 2011. This is your ground truth for:- Actor attribution: Who performed what actions
- Timeline reconstruction: When events occurred (UTC timestamps)
- Deleted content recovery: Issues, PRs, tags, and branches persist in archive
- Automation vs API abuse detection: Presence/absence of WorkflowRunEvent indicates legitimate workflow vs direct API attack
2. GitHub API
Live repository data- Current commit content and metadata
- File contents at specific refs
- Branch and tag information
- PR and issue details (if not deleted)
- Fork relationships
- Retrieve commit content after getting SHA from GH Archive
- Verify current repository state
- Cross-reference with archived data
3. Wayback Machine
Recover deleted web content- Deleted README files and documentation
- Issue/PR descriptions and comments (Archive Team prioritizes these)
- Repository metadata snapshots
- Wiki pages
- Release notes
- README files and repository descriptions
- Issue titles, bodies, and comments
- PR conversations (Files Changed tab often unavailable)
- Commit SHAs from archived commit list pages
- Private repository content
- Complete git history or clones
- Content behind authentication
4. Local Git Repositories
Clone and analyze dangling commitsInvestigation Workflow
The orchestrator coordinates 17 specialized agents through a structured workflow:Phase 0: Initialize Investigation
Creates timestamped output directory and initializes empty evidence store:Phase 1: Parse Research Question
Extracts investigation targets:- Repository references (
owner/repo) - Actor usernames
- Date ranges
- Vendor report URLs
Phase 2: Parallel Evidence Collection
Spawns investigators in parallel for efficiency:evidence.json store.
Phase 3: Hypothesis Formation Loop
Phase 4: Evidence Verification
All collected evidence is re-verified against original sources:- GH Archive events re-queried from BigQuery
- GitHub API observations re-fetched
- Wayback snapshots re-checked
- Local git commits re-validated
Phase 5: Hypothesis Validation Loop
- Every claim must cite evidence by ID
- Attribution must have HIGH confidence with multiple sources
- Timeline must have exact UTC timestamps
- Impact assessment must be evidence-backed
Phase 6: Generate Report
Produces final forensic report with:Real-World Investigation Patterns
Deleted PR Recovery
Scenario: Media claims attacker submitted malicious PR in “late June” but PR is now deleted. Investigation:- Query GH Archive for PR events:
- Outcome:
- If events found: Claim verified → PR existed, recover details from archive
- If no events: Claim disproven → No PR activity in claimed timeframe
Force Push Detection
Scenario: Suspicious commits appear then disappear from branch history. Detection: Zero-commit PushEvents indicate force pushes:before SHA points to the “deleted” commit, which remains accessible on GitHub.
Automation vs Direct API Attack
Scenario: Commit appears under automation account. Was it legitimate workflow or compromised token? Detection:- Legitimate workflow:
WorkflowRunEventshortly before/afterPushEvent - Direct API attack:
PushEventwith NOWorkflowRunEventin ±10 minute window
Real Example: Amazon Q investigation proved direct API attack when ZERO workflow events appeared during malicious commit window, despite 18 workflows that day in different time windows.
Cost Management for BigQuery
GitHub Archive queries cost $6.25 per TiB of data scanned.Optimization Strategies
1. Select Only Required Columns (50-90% cost reduction)Output and Evidence Store
All results are saved to.out/oss-forensics-{timestamp}/
evidence.json Structure
Forensic Report Format
The final report includes:- Executive Summary: High-level findings
- Timeline: Chronological event sequence with evidence citations
- Attribution: Actor identification with confidence levels
- Intent Analysis: Evidence-based reasoning about attacker goals
- Impact Assessment: Affected systems and scope
- IOCs: Extractable indicators for detection rules
- Evidence Appendix: Full evidence details with verification status
Best Practices
Start with Specific Questions
Start with Specific Questions
Good: “Did lkmanka58 create any tags on aws/aws-toolkit-vscode on July 13, 2025?”Bad: “Investigate aws/aws-toolkit-vscode”Specific questions lead to targeted evidence collection and faster results.
Trust the Evidence Store
Trust the Evidence Store
All evidence is verified against original sources. Don’t second-guess the verification pipeline - if evidence is in the store with
verified_at timestamp, it has been re-checked.Use Incremental Investigation
Use Incremental Investigation
Let the hypothesis former request additional evidence rather than collecting everything upfront. This saves time and BigQuery costs.
Cite Everything
Cite Everything
Every claim in the final report must cite evidence by ID. No speculation, only evidence-backed conclusions.
Troubleshooting
Cause: Missing or invalid See GitHub Archive skill setup for credential creation.
GOOGLE_APPLICATION_CREDENTIALSFix:Possible causes:
- Event occurred outside searched date range (check timezone - GH Archive uses UTC)
- Actor username misspelled (case-sensitive)
- Repository name incorrect (must be
owner/repoformat)
Investigation proceeds with current evidence and notes uncertainty in report.To allow more iterations:
Further Reading
GitHub Archive Skill
Full BigQuery query reference and optimization guide
Evidence Kit
Evidence collection and verification API documentation
Wayback Recovery
Deleted content recovery patterns and CDX API reference
Example Reports
Real-world forensic investigation case studies