Skip to main content
The GitHub integration is not actively maintained and may be deprecated in the future. We’re considering removing it as usage has been low and maintenance is complex. If you rely on this integration, please let us know!
The GitHub integration allows you to index as many repositories as you want, making your code searchable and enabling you to chat with Khoj about your codebase.

What Gets Indexed

Khoj indexes the following file types from your GitHub repositories:

Markdown

.md files

Org Mode

.org files

Text & Code

All text-based files (detected automatically)
Khoj automatically detects file types using content analysis. Most code files (Python, JavaScript, Java, etc.) and configuration files will be indexed, while binary files are excluded.

Setup Instructions

To connect a GitHub repository to Khoj:
1

Generate Personal Access Token

Create a classic Personal Access Token (PAT) from GitHub SettingsRequired scopes:
  • repo - Full control of private repositories
  • admin:org - Read organization data (if indexing org repositories)
2

Navigate to Khoj Settings

Go to app.khoj.dev/settings#github (or your self-hosted equivalent)
3

Enter GitHub Configuration

  • Paste your Personal Access Token in the PAT field
  • Add repository details for each repository you want to index:
    • Repository owner (username or organization)
    • Repository name
    • Branch to index (e.g., main, master, develop)
4

Save Configuration

Click “Save” to store your GitHub settings
5

Index Repositories

Go back to the main settings page and click “Configure” to start indexing your repositories
6

Wait for Processing

Khoj will download and process the specified repositories. This may take time for large repositories.
Keep your Personal Access Token secure! Never share it or commit it to version control. The token provides access to your repositories according to the scopes you’ve granted.

How Repository Content is Processed

When you index a GitHub repository:
  1. Repository Download: Khoj fetches the repository contents via the GitHub API
  2. File Type Detection: Each file is analyzed to determine if it’s text-based or binary
  3. Content Extraction: Text content is extracted from markdown, org mode, and plain text/code files
  4. Chunking: Files are split into manageable chunks (~256 tokens) while preserving code structure
  5. Indexing: Content is embedded and stored in the search index with repository URLs for reference

File References

Search results and chat responses will include direct links to the files in your GitHub repository, making it easy to jump to the source code. Example: https://github.com/owner/repo/blob/main/src/file.py

Performance Considerations

Repository Size: Large repositories take considerably longer to process. The initial indexing of a large codebase can take 10-30 minutes or more.Rate Limits: GitHub API has rate limits. Using a Personal Access Token increases your rate limit significantly compared to unauthenticated requests.

Best Practices

  • Start with smaller repositories to test the integration
  • Index specific branches that are actively maintained rather than all branches
  • Be selective - only index repositories you frequently need to reference
  • Monitor rate limits - the GitHub API allows 5,000 requests per hour with authentication

Rate Limiting

Khoj handles GitHub API rate limits automatically:
  • If you hit the rate limit during indexing, Khoj will wait until your rate limit resets
  • You’ll see messages in the logs indicating rate limit status
  • Using an authenticated Personal Access Token provides much higher rate limits
If you index many large repositories simultaneously, you may hit GitHub’s rate limits. Khoj will pause and resume automatically, but indexing will take longer.

Searching Code

Once your repositories are indexed, you can:
  • Search for code patterns: Find functions, classes, or specific implementations
  • Search documentation: Locate README files and inline documentation
  • Cross-repository search: Search across all your indexed repositories at once
  • Ask coding questions: Chat with Khoj about your codebase architecture, implementations, and patterns

Search Features

Learn more about searching your indexed content

Security and Privacy

Your repository contents are processed and stored according to your Khoj deployment:
  • Khoj Cloud: Content is stored securely with encryption at rest and in transit
  • Self-Hosted: All data remains on your infrastructure

Access Control

  • Your Personal Access Token is stored securely and used only to access GitHub
  • Khoj respects your GitHub permissions - it can only access repositories your token has access to
  • Private repositories remain private; only you can search your indexed content

Updating Indexed Repositories

To refresh the content from your GitHub repositories:
  1. Go to app.khoj.dev/settings
  2. Click “Configure” to trigger a re-sync
  3. Khoj will fetch the latest changes from the specified branches
Khoj performs incremental updates where possible, only re-indexing changed files. This makes subsequent syncs faster than the initial indexing.

Troubleshooting

Repository not indexing

1

Verify Token Permissions

Ensure your PAT has the repo scope and hasn’t expired
2

Check Repository Details

Verify the owner, repository name, and branch are spelled correctly
3

Check Repository Access

Confirm your GitHub account has access to the repository (especially for private repos)
4

Review Rate Limits

Check if you’ve hit GitHub API rate limits. Wait for the limit to reset or use a token with higher limits.

”Rate limit reached” errors

  • Wait for your GitHub API rate limit to reset (typically within an hour)
  • Check your current rate limit status at github.com/settings/tokens
  • Ensure you’re using an authenticated Personal Access Token (much higher limits than unauthenticated requests)
  • Consider indexing fewer repositories or doing it in batches

Missing files in search results

  • Binary files are intentionally excluded from indexing
  • Very large files may be skipped or truncated
  • Ensure the files exist in the branch you specified
  • Check that the file types are text-based (code, markdown, org, or plain text)

Slow indexing performance

  • Large repositories (10,000+ files) can take significant time to process
  • GitHub API responses can be slow for repositories with deep directory structures
  • Consider indexing only the specific branches you need
  • The recursive tree API call fetches all files at once, which can be slow for large repos

Removing GitHub Integration

To disconnect GitHub:
  1. Go to app.khoj.dev/settings#github
  2. Remove your Personal Access Token
  3. Remove all repository configurations
  4. Click “Save”
  5. Your indexed repository content will be removed from Khoj
Optionally, you can also revoke the Personal Access Token in GitHub Settings if you’re no longer using it.
Removing the GitHub integration will delete all indexed repository content from Khoj. This action cannot be undone.

Alternative: Upload Repository Archives

If you don’t want to use the GitHub integration or have issues with it, you can:
  1. Clone your repository locally
  2. Use the Khoj desktop app to index the local folder
  3. Or zip the repository and upload it as files through the web interface
This approach gives you more control but doesn’t provide automatic updates when your repository changes.

Build docs developers (and LLMs) love