Overview
Doc-MCP is a comprehensive documentation RAG system that transforms GitHub repositories into queryable knowledge bases accessible via the Model Context Protocol (MCP). It provides semantic search, AI-powered Q&A, and batch processing capabilities for documentation ingestion.Features
- Semantic Search - Find answers across documentation using natural language
- AI-Powered Q&A - Get intelligent responses with source citations
- Batch Processing - Ingest entire repositories with progress tracking
- Incremental Updates - Detect and sync only changed files
- MCP Server - Expose documentation through Model Context Protocol
- Repository Management - Complete CRUD operations for ingested docs
Architecture
Prerequisites
- Python 3.13+
- MongoDB Atlas with Vector Search enabled
- Nebius API key for embeddings and LLM
- GitHub token (optional, for private repos)
Installation
Usage
Starting the Application
http://localhost:7860.
MCP Server Access
The MCP server is automatically exposed at:MCP Integration
Gradio MCP Server
Doc-MCP uses Gradio’s built-in MCP server functionality:MCP Tools Available
list_available_repos_docs
list_available_repos_docs
List all ingested repositories.Returns: List of repositories with names and branches
list_repository_files
list_repository_files
List files in a repository with optional filtering.Parameters:
repo_name(str): Repository namefile_extensions(List[str]): Extensions to filter (default: [“.md”, “.mdx”])branch(str): Branch name (default: “main”)
get_single_file_content_from_repo
get_single_file_content_from_repo
Retrieve content of a single file.Parameters:
repo_name(str): Repository namefile_path(str): Path to filebranch(str): Branch name (default: “main”)
get_multi_file_content_from_repo
get_multi_file_content_from_repo
Retrieve content of multiple files.Parameters:
repo_name(str): Repository namefile_paths(List[str]): List of file pathsbranch(str): Branch name (default: “main”)
query_doc
query_doc
Execute semantic search query against repository.Parameters:
repo_name(str): Repository namequery(str): Natural language querymode(str): Search mode (“default”, “text_search”, “hybrid”)top_k(int): Number of results (default: 5, max: 100)
Web Interface Usage
Documentation Ingestion
- Navigate to ”📥 Documentation Ingestion” tab
- Enter GitHub repository URL (e.g.,
owner/repo) - Select markdown files to process
- Execute two-step pipeline:
- Load files from GitHub
- Generate embeddings
Query Documentation
- Go to ”🤖 AI Documentation Assistant” tab
- Select your repository
- Ask natural language questions
- Get AI responses with source citations
Repository Management
- Use “⚙️ Repository Management” tab
- View statistics and file counts
- Delete repositories when needed
Configuration
Environment Variables
MongoDB Atlas Setup
Database Structure
The following collections are auto-created:
doc_rag- Documents with embeddingsingested_repos- Repository metadata
Troubleshooting
Rate Limits
Rate Limits
Add GitHub token for 5000 requests/hour (vs 60 without token)
Memory Issues
Memory Issues
Reduce
CHUNK_SIZE in .env:Connection Errors
Connection Errors
Verify MongoDB Atlas Vector Search is enabled:
Advanced Features
RAG Pipeline
Doc-MCP uses LlamaIndex for RAG implementation:GitHub Integration
Source Code
View the complete source code at:~/workspace/source/mcp_ai_agents/doc_mcp/
GitHub Repository
View the official repository
MongoDB Atlas
Learn about Vector Search
Related Examples
GitHub MCP Agent
Explore repositories with natural language
Custom MCP Server
Build your own MCP server