What Gets Indexed
Khoj indexes the following file types from your GitHub repositories:Markdown
.md filesOrg Mode
.org filesText & Code
All text-based files (detected automatically)
Khoj automatically detects file types using content analysis. Most code files (Python, JavaScript, Java, etc.) and configuration files will be indexed, while binary files are excluded.
Setup Instructions
To connect a GitHub repository to Khoj:Generate Personal Access Token
Create a classic Personal Access Token (PAT) from GitHub SettingsRequired scopes:
repo- Full control of private repositoriesadmin:org- Read organization data (if indexing org repositories)
Navigate to Khoj Settings
Go to app.khoj.dev/settings#github (or your self-hosted equivalent)
Enter GitHub Configuration
- Paste your Personal Access Token in the PAT field
- Add repository details for each repository you want to index:
- Repository owner (username or organization)
- Repository name
- Branch to index (e.g.,
main,master,develop)
Index Repositories
Go back to the main settings page and click “Configure” to start indexing your repositories
How Repository Content is Processed
When you index a GitHub repository:- Repository Download: Khoj fetches the repository contents via the GitHub API
- File Type Detection: Each file is analyzed to determine if it’s text-based or binary
- Content Extraction: Text content is extracted from markdown, org mode, and plain text/code files
- Chunking: Files are split into manageable chunks (~256 tokens) while preserving code structure
- Indexing: Content is embedded and stored in the search index with repository URLs for reference
File References
Search results and chat responses will include direct links to the files in your GitHub repository, making it easy to jump to the source code. Example:https://github.com/owner/repo/blob/main/src/file.py
Performance Considerations
Repository Size: Large repositories take considerably longer to process. The initial indexing of a large codebase can take 10-30 minutes or more.Rate Limits: GitHub API has rate limits. Using a Personal Access Token increases your rate limit significantly compared to unauthenticated requests.
Best Practices
- Start with smaller repositories to test the integration
- Index specific branches that are actively maintained rather than all branches
- Be selective - only index repositories you frequently need to reference
- Monitor rate limits - the GitHub API allows 5,000 requests per hour with authentication
Rate Limiting
Khoj handles GitHub API rate limits automatically:- If you hit the rate limit during indexing, Khoj will wait until your rate limit resets
- You’ll see messages in the logs indicating rate limit status
- Using an authenticated Personal Access Token provides much higher rate limits
Searching Code
Once your repositories are indexed, you can:- Search for code patterns: Find functions, classes, or specific implementations
- Search documentation: Locate README files and inline documentation
- Cross-repository search: Search across all your indexed repositories at once
- Ask coding questions: Chat with Khoj about your codebase architecture, implementations, and patterns
Search Features
Learn more about searching your indexed content
Security and Privacy
Your repository contents are processed and stored according to your Khoj deployment:
- Khoj Cloud: Content is stored securely with encryption at rest and in transit
- Self-Hosted: All data remains on your infrastructure
Access Control
- Your Personal Access Token is stored securely and used only to access GitHub
- Khoj respects your GitHub permissions - it can only access repositories your token has access to
- Private repositories remain private; only you can search your indexed content
Updating Indexed Repositories
To refresh the content from your GitHub repositories:- Go to app.khoj.dev/settings
- Click “Configure” to trigger a re-sync
- Khoj will fetch the latest changes from the specified branches
Khoj performs incremental updates where possible, only re-indexing changed files. This makes subsequent syncs faster than the initial indexing.
Troubleshooting
Repository not indexing
Check Repository Access
Confirm your GitHub account has access to the repository (especially for private repos)
”Rate limit reached” errors
- Wait for your GitHub API rate limit to reset (typically within an hour)
- Check your current rate limit status at github.com/settings/tokens
- Ensure you’re using an authenticated Personal Access Token (much higher limits than unauthenticated requests)
- Consider indexing fewer repositories or doing it in batches
Missing files in search results
- Binary files are intentionally excluded from indexing
- Very large files may be skipped or truncated
- Ensure the files exist in the branch you specified
- Check that the file types are text-based (code, markdown, org, or plain text)
Slow indexing performance
- Large repositories (10,000+ files) can take significant time to process
- GitHub API responses can be slow for repositories with deep directory structures
- Consider indexing only the specific branches you need
- The recursive tree API call fetches all files at once, which can be slow for large repos
Removing GitHub Integration
To disconnect GitHub:- Go to app.khoj.dev/settings#github
- Remove your Personal Access Token
- Remove all repository configurations
- Click “Save”
- Your indexed repository content will be removed from Khoj
Alternative: Upload Repository Archives
If you don’t want to use the GitHub integration or have issues with it, you can:- Clone your repository locally
- Use the Khoj desktop app to index the local folder
- Or zip the repository and upload it as files through the web interface
