Skip to main content

Overview

Doc-MCP is a comprehensive documentation RAG system that transforms GitHub repositories into queryable knowledge bases accessible via the Model Context Protocol (MCP). It provides semantic search, AI-powered Q&A, and batch processing capabilities for documentation ingestion.

Features

  • Semantic Search - Find answers across documentation using natural language
  • AI-Powered Q&A - Get intelligent responses with source citations
  • Batch Processing - Ingest entire repositories with progress tracking
  • Incremental Updates - Detect and sync only changed files
  • MCP Server - Expose documentation through Model Context Protocol
  • Repository Management - Complete CRUD operations for ingested docs

Architecture

Prerequisites

  • Python 3.13+
  • MongoDB Atlas with Vector Search enabled
  • Nebius API key for embeddings and LLM
  • GitHub token (optional, for private repos)

Installation

1

Clone Repository

git clone https://github.com/md-abid-hussain/doc-mcp.git
cd doc-mcp
2

Create Virtual Environment

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows
3

Install Dependencies

pip install -r requirements.txt
4

Configure Environment

Create a .env file:
cp .env.example .env
Add your credentials:
NEBIUS_API_KEY=your_nebius_api_key_here
MONGODB_URI=mongodb+srv://username:[email protected]/
GITHUB_API_KEY=your_github_token_here  # Optional
5

Setup Database

python scripts/db_setup.py setup

Usage

Starting the Application

python main.py
The application will launch a Gradio interface at http://localhost:7860.

MCP Server Access

The MCP server is automatically exposed at:
http://127.0.0.1:7860/gradio_api/mcp/sse

MCP Integration

Gradio MCP Server

Doc-MCP uses Gradio’s built-in MCP server functionality:
from src.ui.main import main

class DocMCPApp:
    def launch(self, **kwargs):
        """Launch with MCP server enabled."""
        demo = self.create_interface()
        return demo.launch(mcp_server=True)

MCP Tools Available

List all ingested repositories.Returns: List of repositories with names and branches
def list_available_repos_docs(self) -> List[Dict[str, str]]:
    repos = repository_manager.repos_collection.find(
        {}, {"repo_name": 1, "branch": 1, "_id": 0}
    )
    return list(repos)
List files in a repository with optional filtering.Parameters:
  • repo_name (str): Repository name
  • file_extensions (List[str]): Extensions to filter (default: [“.md”, “.mdx”])
  • branch (str): Branch name (default: “main”)
Returns: List of file paths
def list_repository_files(
    self,
    repo_name: str,
    file_extensions: Optional[List[str]] = None,
    branch: Optional[str] = None,
) -> Union[List[str], Dict[str, str]]:
    if not file_extensions:
        file_extensions = [".md", ".mdx"]
    
    filtered_files, _ = github_client.get_repository_tree(
        repo_url=repo_name, 
        file_extensions=file_extensions, 
        branch=branch
    )
    return filtered_files
Retrieve content of a single file.Parameters:
  • repo_name (str): Repository name
  • file_path (str): Path to file
  • branch (str): Branch name (default: “main”)
Returns: File content
def get_single_file_content_from_repo(
    self, 
    repo_name: str, 
    file_path: str, 
    branch: Optional[str] = None
) -> Dict[str, str]:
    file, _ = asyncio.run(
        load_files_from_github(
            repo_url=repo_name, 
            file_paths=[file_path], 
            branch=branch
        )
    )
    return {
        "file_path": file_path, 
        "content": file[0].get_content()
    }
Retrieve content of multiple files.Parameters:
  • repo_name (str): Repository name
  • file_paths (List[str]): List of file paths
  • branch (str): Branch name (default: “main”)
Returns: List of file contents
Execute semantic search query against repository.Parameters:
  • repo_name (str): Repository name
  • query (str): Natural language query
  • mode (str): Search mode (“default”, “text_search”, “hybrid”)
  • top_k (int): Number of results (default: 5, max: 100)
Returns: Query results with source nodes
def query_doc(
    self,
    repo_name: str,
    query: str,
    mode: Optional[str] = "default",
    top_k: Optional[int] = None,
) -> Dict[str, Any]:
    if not top_k or top_k <= 0:
        top_k = 5
    if top_k > 100:
        top_k = 100
        
    return create_query_retriever(repo_name).make_query(
        query, mode, top_k
    )

Web Interface Usage

Documentation Ingestion

  1. Navigate to ”📥 Documentation Ingestion” tab
  2. Enter GitHub repository URL (e.g., owner/repo)
  3. Select markdown files to process
  4. Execute two-step pipeline:
    • Load files from GitHub
    • Generate embeddings

Query Documentation

  1. Go to ”🤖 AI Documentation Assistant” tab
  2. Select your repository
  3. Ask natural language questions
  4. Get AI responses with source citations

Repository Management

  1. Use “⚙️ Repository Management” tab
  2. View statistics and file counts
  3. Delete repositories when needed

Configuration

Environment Variables

# Required
NEBIUS_API_KEY=your_nebius_api_key_here
MONGODB_URI=mongodb+srv://username:[email protected]/

# Optional
GITHUB_API_KEY=your_github_token_here
CHUNK_SIZE=3072
SIMILARITY_TOP_K=5
GITHUB_CONCURRENT_REQUESTS=10

MongoDB Atlas Setup

1

Create Cluster

Create a MongoDB Atlas cluster with Vector Search enabled
2

Database Structure

The following collections are auto-created:
  • doc_rag - Documents with embeddings
  • ingested_repos - Repository metadata
3

Vector Search Index

Ensure Vector Search is properly configured for semantic search

Troubleshooting

Add GitHub token for 5000 requests/hour (vs 60 without token)
GITHUB_API_KEY=your_github_token_here
Reduce CHUNK_SIZE in .env:
CHUNK_SIZE=2048
Verify MongoDB Atlas Vector Search is enabled:
python scripts/db_setup.py status

Advanced Features

RAG Pipeline

Doc-MCP uses LlamaIndex for RAG implementation:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorStore

# Vector store configuration
vector_store = MongoDBAtlasVectorStore(
    mongodb_client=client,
    db_name="doc_rag",
    collection_name="documents",
    index_name="vector_index"
)

# Query with semantic search
query_engine = index.as_query_engine(
    similarity_top_k=top_k,
    search_mode=mode
)

GitHub Integration

from src.github.client import github_client
from src.github.file_loader import load_files_from_github

# Load files from repository
files, stats = await load_files_from_github(
    repo_url="owner/repo",
    file_paths=["docs/intro.md"],
    branch="main"
)

Source Code

View the complete source code at: ~/workspace/source/mcp_ai_agents/doc_mcp/

GitHub Repository

View the official repository

MongoDB Atlas

Learn about Vector Search

GitHub MCP Agent

Explore repositories with natural language

Custom MCP Server

Build your own MCP server

Build docs developers (and LLMs) love