Module Location
src/artifactminer/RepositoryIntelligence/
Core Components
1. Repository-Level Intelligence
File:repo_intelligence_main.py
getRepoStats(repo_path)
Extracts comprehensive statistics for a Git repository.
Location: src/artifactminer/RepositoryIntelligence/repo_intelligence_main.py:134-189
Process:
-
Git Repository Validation
- Checks for
.gitdirectory usingisGitRepo()(line 29-31) - Raises
ValueErrorif not a valid Git repository
- Checks for
-
Project Name Extraction
- Uses directory name as project name (line 141)
- Example:
/path/to/my-project→"my-project"
-
Language Detection
- Traverses HEAD commit tree to analyze all files (line 145-158)
- Counts file extensions (
.py,.js,.java, etc.) - Calculates language percentages based on file counts
- Identifies primary language as most common extension
- Example output:
-
Framework Detection
- Calls
detect_frameworks()fromframework_detector.py(line 161) - See Framework Detection section below
- Calls
-
Collaboration Detection
- Checks for remote repositories (line 164)
- Counts unique author emails across all commits (line 166-167)
- Repository is collaborative if:
- Has remote repositories, OR
- Has more than one unique author email
- Example:
-
Commit Timeline
- Iterates through all commits (line 170)
- Extracts first commit (oldest) and last commit (newest) dates (line 171-172)
- Counts total commits
-
Health Score Calculation
- Calls
calculateRepoHealth()(line 175) - See Health Scoring section below
- Calls
RepoStats dataclass with:
project_name: Repository nameproject_path: Filesystem pathis_collaborative: BooleanLanguages: List of file extensionslanguage_percentages: List of percentagesprimary_language: Most common extensionfirst_commit: Earliest commit datetimelast_commit: Most recent commit datetimetotal_commits: Total commit countframeworks: List of detected frameworkshealth_score: 0-100 health score
repo_stats table via saveRepoStats() (line 191-231)
Health Scoring
Function:calculateRepoHealth(repo_path, last_commit, total_commits)
Location: src/artifactminer/RepositoryIntelligence/repo_intelligence_main.py:47-132
Algorithm: Scores repositories from 0.0 to 100.0 based on five categories:
1. Documentation Presence (30 points max)
| File/Directory | Points |
|---|---|
| README.md | 12 |
| README | 8 |
| LICENSE | 8 |
| CONTRIBUTING.md | 5 |
| CHANGELOG.md | 3 |
| docs/ directory | 2 |
2. Commit Recency (25 points max)
Logic (lines 77-90):| Days Since Last Commit | Points |
|---|---|
| ≤ 7 days | 25 |
| ≤ 30 days | 20 |
| ≤ 90 days | 15 |
| ≤ 180 days | 10 |
| ≤ 365 days | 5 |
| > 365 days | 0 |
3. Commit Activity (20 points max)
Logic (lines 92-101):| Total Commits | Points |
|---|---|
| ≥ 100 | 20 |
| ≥ 50 | 15 |
| ≥ 20 | 10 |
| ≥ 10 | 5 |
| < 10 | 0 |
4. Test Presence (15 points max)
Logic (lines 103-120): Checks for:- Test directories:
tests/,test/,__tests__/,spec/,specs/ - Test configs:
pytest.ini,jest.config.js,karma.conf.js - Test files:
test_*.py,*_test.py,*.test.js,*.spec.js
5. Configuration Files (10 points max)
Logic (lines 122-130): Checks for:.gitignore.editorconfig.prettierrc.eslintrcpyproject.toml
Framework Detection
File:framework_detector.py
Function: detect_frameworks(repo_path)
Location: src/artifactminer/RepositoryIntelligence/framework_detector.py:134-161
Process:
-
Calls ecosystem-specific detectors:
detect_python_frameworks()(lines 27-51)detect_javascript_frameworks()(lines 54-80)detect_java_frameworks()(lines 83-106)detect_go_frameworks()(lines 109-131)
- Deduplicates results while preserving order
- Returns list of framework names
Python Framework Detection
Location: Lines 27-51 Dependency Files Scanned:requirements.txtpyproject.tomlPipfilesetup.py
requirements.txt contains fastapi==0.104.1, detects “FastAPI” framework.
JavaScript/TypeScript Framework Detection
Location: Lines 54-80 Dependency File:package.json
Detection Method:
package.json has "react": "^18.2.0", detects “React” framework.
Java Framework Detection
Location: Lines 83-106 Dependency Files:pom.xml(Maven)build.gradle(Gradle)build.gradle.kts(Kotlin Gradle)
"spring" in pom.xml → “Spring” framework
Go Framework Detection
Location: Lines 109-131 Dependency File:go.mod
Example: github.com/gin-gonic/gin → “Gin” framework
2. User-Level Intelligence
File:repo_intelligence_user.py
getUserRepoStats(repo_path, user_email)
Extracts user-specific contribution statistics.
Location: src/artifactminer/RepositoryIntelligence/repo_intelligence_user.py:32-73
Process:
-
Repository Validation
- Validates repo path using
isGitRepo() - Validates email format using
email_validator(lines 35-37)
- Validates repo path using
-
User Commit Filtering
- Filters commits by
author.emailmatching user email (line 45) - Example:
- Filters commits by
-
Contribution Percentage
- Calculates:
(user_commits / total_commits) * 100(line 52) - Example: 30 user commits out of 100 total = 30.0%
- Calculates:
-
Commit Frequency
- Time window: First commit to last commit by user (line 54)
- Converts to weeks:
delta.total_seconds() / 604800(line 55) - Calculates:
total_commits / weeks(line 59) - Example: 30 commits over 10 weeks = 3.0 commits/week
-
Activity Classification
- Calls
classify_commit_activities()(line 62) - See Activity Classification section below
- Calls
UserRepoStats dataclass with:
project_name: Repository nameproject_path: Filesystem pathfirst_commit: User’s first commit datetimelast_commit: User’s last commit datetimetotal_commits: User’s commit countuserStatspercentages: User’s contribution %commitFrequency: Commits per weekcommitActivities: Activity breakdown dictuser_role: User’s role (optional, can be set later)
user_repo_stats table via saveUserRepoStats() (lines 160-208)
Activity Classification
File:activity_classifier.py
Function: classify_commit_activities(additions)
Location: src/artifactminer/RepositoryIntelligence/activity_classifier.py:6-242
Purpose: Classifies commit content into five activity types:
- code: Actual code additions
- test: Test-related changes
- docs: Documentation (comments, docstrings, markdown)
- config: Configuration files (YAML, TOML, env vars)
- design: Design artifacts (Figma links, wireframes)
1. Per-Commit Analysis (lines 19-213)
For each commit’s additions: a. Skip Blank Lines#, //, /*):
- Environment variables:
FOO=barwith uppercase keys - YAML-style:
key: value - Config keywords:
toml,yaml,json,settings
2. Accumulation (lines 195-213)
3. Percentage Calculation (lines 216-240)
Goal: Distribute 100% across categories based on line countsUser Additions Collection
Function:collect_user_additions(repo_path, user_email, ...)
Location: src/artifactminer/RepositoryIntelligence/repo_intelligence_user.py:93-147
Purpose: Collect added lines from user’s commits for AI summary generation.
Process:
-
Validate Inputs
- Check repo path is valid Git repository
- Validate and normalize email address (lines 107-114)
-
Filter Commits
- Get commits in date range (
sincetountil) - Filter by author email
- Skip merge commits if
skip_merges=True(lines 118-123) - Reverse to chronological order (oldest → newest) (line 126)
- Get commits in date range (
-
Extract Additions per Commit (lines 129-146)
-
Extract Added Lines (lines 76-90)
generate_summaries_for_ranked() (lines 210-330)
Collaboration Detection
Logic: Repository is collaborative if:-
Has Remote Repositories
-
OR Has Multiple Authors
src/artifactminer/RepositoryIntelligence/repo_intelligence_main.py:164-167
Impact:
- Collaborative repos: User-scoped skill extraction
- Solo repos: Full repository skill extraction
API Integration
Repository intelligence is invoked by the analyze router: Router:src/artifactminer/api/analyze.py
Endpoints:
POST /analyze/{zip_id}- Analyze all repos in a ZIPPOST /repos/analyze- Analyze specific repository path
- Discover Git repositories in extraction path
- For each repository:
- Call
getRepoStats(repo_path) - Call
saveRepoStats(stats, db) - Get user email from question answers
- Call
getUserRepoStats(repo_path, user_email) - Call
saveUserRepoStats(user_stats, db)
- Call
- Trigger skill extraction and evidence generation
Performance Considerations
Large Repositories
- Commit iteration: Can be slow for repos with 10,000+ commits
- Mitigation: Use
max_countparameter initer_commits()
File Tree Traversal
- Language detection: Traverses entire HEAD commit tree
- Optimization: Only counts extensions, doesn’t read file contents
User Additions Collection
- Patch generation: Memory-intensive for large commits
- Safety:
max_patch_bytescaps individual patches (default: 200KB) - Limiting:
max_commitsparameter (default: 500)