Doc-MCP

Overview

Doc-MCP is a comprehensive documentation RAG system that transforms GitHub repositories into queryable knowledge bases accessible via the Model Context Protocol (MCP). It provides semantic search, AI-powered Q&A, and batch processing capabilities for documentation ingestion.

Features

Semantic Search - Find answers across documentation using natural language
AI-Powered Q&A - Get intelligent responses with source citations
Batch Processing - Ingest entire repositories with progress tracking
Incremental Updates - Detect and sync only changed files
MCP Server - Expose documentation through Model Context Protocol
Repository Management - Complete CRUD operations for ingested docs

Architecture

Prerequisites

Python 3.13+
MongoDB Atlas with Vector Search enabled
Nebius API key for embeddings and LLM
GitHub token (optional, for private repos)

Installation

Clone Repository

git clone https://github.com/md-abid-hussain/doc-mcp.git
cd doc-mcp

Create Virtual Environment

python -m venv .venv
source .venv/bin/activate  # Linux/Mac
# .venv\Scripts\activate   # Windows

Install Dependencies

pip install -r requirements.txt

Configure Environment

Create a .env file:

cp .env.example .env

Add your credentials:

NEBIUS_API_KEY=your_nebius_api_key_here
MONGODB_URI=mongodb+srv://username:[email protected]/
GITHUB_API_KEY=your_github_token_here  # Optional

Setup Database

python scripts/db_setup.py setup

Usage

Starting the Application

python main.py

The application will launch a Gradio interface at http://localhost:7860.

MCP Server Access

The MCP server is automatically exposed at:

http://127.0.0.1:7860/gradio_api/mcp/sse

MCP Integration

Gradio MCP Server

Doc-MCP uses Gradio’s built-in MCP server functionality:

from src.ui.main import main

class DocMCPApp:
    def launch(self, **kwargs):
        """Launch with MCP server enabled."""
        demo = self.create_interface()
        return demo.launch(mcp_server=True)

MCP Tools Available

list_available_repos_docs

List all ingested repositories.Returns: List of repositories with names and branches

def list_available_repos_docs(self) -> List[Dict[str, str]]:
    repos = repository_manager.repos_collection.find(
        {}, {"repo_name": 1, "branch": 1, "_id": 0}
    )
    return list(repos)

list_repository_files

List files in a repository with optional filtering.Parameters:

repo_name (str): Repository name
file_extensions (List[str]): Extensions to filter (default: [“.md”, “.mdx”])
branch (str): Branch name (default: “main”)

Returns: List of file paths

def list_repository_files(
    self,
    repo_name: str,
    file_extensions: Optional[List[str]] = None,
    branch: Optional[str] = None,
) -> Union[List[str], Dict[str, str]]:
    if not file_extensions:
        file_extensions = [".md", ".mdx"]
    
    filtered_files, _ = github_client.get_repository_tree(
        repo_url=repo_name, 
        file_extensions=file_extensions, 
        branch=branch
    )
    return filtered_files

get_single_file_content_from_repo

Retrieve content of a single file.Parameters:

repo_name (str): Repository name
file_path (str): Path to file
branch (str): Branch name (default: “main”)

Returns: File content

def get_single_file_content_from_repo(
    self, 
    repo_name: str, 
    file_path: str, 
    branch: Optional[str] = None
) -> Dict[str, str]:
    file, _ = asyncio.run(
        load_files_from_github(
            repo_url=repo_name, 
            file_paths=[file_path], 
            branch=branch
        )
    )
    return {
        "file_path": file_path, 
        "content": file[0].get_content()
    }

get_multi_file_content_from_repo

Retrieve content of multiple files.Parameters:

repo_name (str): Repository name
file_paths (List[str]): List of file paths
branch (str): Branch name (default: “main”)

Returns: List of file contents

query_doc

Execute semantic search query against repository.Parameters:

repo_name (str): Repository name
query (str): Natural language query
mode (str): Search mode (“default”, “text_search”, “hybrid”)
top_k (int): Number of results (default: 5, max: 100)

Returns: Query results with source nodes

def query_doc(
    self,
    repo_name: str,
    query: str,
    mode: Optional[str] = "default",
    top_k: Optional[int] = None,
) -> Dict[str, Any]:
    if not top_k or top_k <= 0:
        top_k = 5
    if top_k > 100:
        top_k = 100
        
    return create_query_retriever(repo_name).make_query(
        query, mode, top_k
    )

Web Interface Usage

Documentation Ingestion

Navigate to ”📥 Documentation Ingestion” tab
Enter GitHub repository URL (e.g., owner/repo)
Select markdown files to process
Execute two-step pipeline:
- Load files from GitHub
- Generate embeddings

Query Documentation

Go to ”🤖 AI Documentation Assistant” tab
Select your repository
Ask natural language questions
Get AI responses with source citations

Repository Management

Use “⚙️ Repository Management” tab
View statistics and file counts
Delete repositories when needed

Configuration

Environment Variables

# Required
NEBIUS_API_KEY=your_nebius_api_key_here
MONGODB_URI=mongodb+srv://username:[email protected]/

# Optional
GITHUB_API_KEY=your_github_token_here
CHUNK_SIZE=3072
SIMILARITY_TOP_K=5
GITHUB_CONCURRENT_REQUESTS=10

MongoDB Atlas Setup

Create Cluster

Create a MongoDB Atlas cluster with Vector Search enabled

Database Structure

The following collections are auto-created:

doc_rag - Documents with embeddings
ingested_repos - Repository metadata

Vector Search Index

Ensure Vector Search is properly configured for semantic search

Troubleshooting

Rate Limits

Add GitHub token for 5000 requests/hour (vs 60 without token)

GITHUB_API_KEY=your_github_token_here

Memory Issues

Reduce CHUNK_SIZE in .env:

CHUNK_SIZE=2048

Connection Errors

Verify MongoDB Atlas Vector Search is enabled:

python scripts/db_setup.py status

Advanced Features

RAG Pipeline

Doc-MCP uses LlamaIndex for RAG implementation:

from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorStore

# Vector store configuration
vector_store = MongoDBAtlasVectorStore(
    mongodb_client=client,
    db_name="doc_rag",
    collection_name="documents",
    index_name="vector_index"
)

# Query with semantic search
query_engine = index.as_query_engine(
    similarity_top_k=top_k,
    search_mode=mode
)

GitHub Integration

from src.github.client import github_client
from src.github.file_loader import load_files_from_github

# Load files from repository
files, stats = await load_files_from_github(
    repo_url="owner/repo",
    file_paths=["docs/intro.md"],
    branch="main"
)

Source Code

View the complete source code at: ~/workspace/source/mcp_ai_agents/doc_mcp/

GitHub Repository

View the official repository

MongoDB Atlas

Learn about Vector Search

GitHub MCP Agent

Explore repositories with natural language

Custom MCP Server

Build your own MCP server

Starter Agents

Simple Agents

MCP Agents

Memory Agents

RAG Applications

Advanced Agents

Overview

Features

Architecture

Prerequisites

Installation

Usage

Starting the Application

MCP Server Access

MCP Integration

Gradio MCP Server

MCP Tools Available

Web Interface Usage

Documentation Ingestion

Query Documentation

Repository Management

Configuration

Environment Variables

MongoDB Atlas Setup

Troubleshooting

Advanced Features

RAG Pipeline

GitHub Integration

Source Code

GitHub Repository

MongoDB Atlas

GitHub MCP Agent

Custom MCP Server

Build docs developers (and LLMs) love

Starter Agents

Simple Agents

MCP Agents

Memory Agents

RAG Applications

Advanced Agents

​Overview

​Features

​Architecture

​Prerequisites

​Installation

​Usage

​Starting the Application

​MCP Server Access

​MCP Integration

​Gradio MCP Server

​MCP Tools Available

​Web Interface Usage

​Documentation Ingestion

​Query Documentation

​Repository Management

​Configuration

​Environment Variables

​MongoDB Atlas Setup

​Troubleshooting

​Advanced Features

​RAG Pipeline

​GitHub Integration

​Source Code

GitHub Repository

MongoDB Atlas

​Related Examples

GitHub MCP Agent

Custom MCP Server

Build docs developers (and LLMs) love

Overview

Features

Architecture

Prerequisites

Installation

Usage

Starting the Application

MCP Server Access

MCP Integration

Gradio MCP Server

MCP Tools Available

Web Interface Usage

Documentation Ingestion

Query Documentation

Repository Management

Configuration

Environment Variables

MongoDB Atlas Setup

Troubleshooting

Advanced Features

RAG Pipeline

GitHub Integration

Source Code

Related Examples