Fetch Tool

The fetch tool retrieves complete document content from WebHelp documentation sites and converts it from HTML to clean Markdown. This enables AI assistants to analyze, cite, and answer questions with full context.

How It Works

When you invoke the fetch tool with a document ID:

Resolves the document URL from the ID (format: index:path)
Downloads the HTML content from the WebHelp site
Extracts the main <article> element to remove navigation and UI
Converts HTML to Markdown using Turndown
Returns the structured content with title, text, and metadata

Always use the id field from search results rather than constructing IDs manually.

Implementation

Here’s the complete fetch implementation from the source code:

// From webhelp-search-client.ts:190-218
async fetchDocumentContent(documentId: string): Promise<{
  id: string;
  title: string;
  text: string;
  url: string;
  metadata?: any;
}> {
  const fullUrl = this.resolveDocumentUrl(documentId);
  let htmlContent = await downloadFile(fullUrl);
  
  // Extract just the article element
  let articleContent = this.extractArticleElement(htmlContent);
  
  // Convert HTML to markdown
  const turndownService = new TurndownService({
    headingStyle: 'atx',
    codeBlockStyle: 'fenced',
    bulletListMarker: '-'
  });
  
  const markdownContent = turndownService.turndown(articleContent);
  
  return {
    id: documentId,
    title: extractTitleFromContent(htmlContent) || documentId,
    text: markdownContent,
    url: fullUrl
  };
}

Document ID Resolution

Document IDs use a composite format to support federated search:

// From webhelp-search-client.ts:220-228
private resolveDocumentUrl(documentId: string): string {
  const [indexStr, ...pathParts] = documentId.split(':');
  const baseUrl = this.baseUrls[Number(indexStr)];
  if (!baseUrl) {
    throw new Error(`Unknown base URL index: ${indexStr}`);
  }
  const path = pathParts.join(':');
  return `${baseUrl}${path}`;
}

ID Format: index:path

index — Zero-based index into the baseUrls array
path — Relative path to the document

Examples:

0:topics/introduction.html — First site, introduction topic
1:reference/api.html — Second site, API reference
0:topics/config/advanced.html — Path with multiple segments

The colon separator means paths themselves can contain colons. The implementation splits on the first colon only.

Article Extraction

The fetch tool extracts only the main content to avoid sending navigation, headers, and footers to the AI:

// From webhelp-search-client.ts:230-241
extractArticleElement(htmlContent: string): string {
  const dom = new JSDOM(htmlContent);
  const document = dom.window.document;
  
  const articleElement = document.querySelector('article');
  if (articleElement) {
    return articleElement.outerHTML;
  }
  
  // If no article element found, return the original content
  return htmlContent;
}

If no <article> element exists, the entire HTML is converted. This may include navigation and other UI elements.

HTML to Markdown Conversion

The server uses Turndown to convert HTML to Markdown:

const turndownService = new TurndownService({
  headingStyle: 'atx',      // Use # syntax for headings
  codeBlockStyle: 'fenced',  // Use ``` for code blocks
  bulletListMarker: '-'      // Use - for unordered lists
});

const markdownContent = turndownService.turndown(articleContent);

Conversion features:

ATX-style headings (#, ##, ###)
Fenced code blocks with language detection
Preserved code formatting and syntax
Tables converted to Markdown tables
Links and images preserved
Lists properly formatted

MCP Tool Definition

Here’s how the fetch tool is exposed via the Model Context Protocol:

// From app/[...site]/route.ts:85-112
server.tool(
  "fetch",
  "Retrieve complete document content by ID for detailed analysis and citation. Use this after finding relevant documents with the search tool.",
  {
    id: z.string().describe("Document ID from search results")
  },
  async ({ id }) => {
    console.log('Tool "fetch" invoked with params:', { id });
    try {
      const fetchResult = await searchClient.fetchDocumentContent(id);

      return {
        content: [{
          type: "text",
          text: JSON.stringify(fetchResult)
        }]
      };
    } catch (error: any) {
      return {
        content: [{
          type: "text",
          text: `Fetch failed: ${error.message}`
        }],
        isError: true
      };
    }
  }
);

Parameters

string

required

Document ID from search results. Format: index:path where index is the site index and path is the relative document path.

Return Value

The fetch tool returns a JSON object with the document content:

{
  "id": "0:topics/getting-started.html",
  "title": "Getting Started with WebHelp",
  "text": "# Getting Started with WebHelp\n\nWebHelp is a...\n\n## Installation\n\n...",
  "url": "https://example.com/docs/topics/getting-started.html"
}

Result Fields

string

The document ID that was requested

title

string

Document title extracted from the HTML <title> tag or page metadata

text

string

Complete document content converted to Markdown format

url

string

Full URL to the original HTML document

metadata

object

Additional metadata if available (currently unused)

Usage Examples

Claude Desktop Workflow

User: Find information about DITA map validation

Claude: [Searches and finds results]
Found 5 relevant documents. Let me fetch the most relevant one.

Claude: [Fetches document 0:topics/validation.html]

Based on the documentation, DITA map validation checks for:
1. Valid topic references
2. Consistent metadata
3. Proper hierarchy structure...

Fetching DITA-OT Documentation

// MCP configuration
{
  "mcpServers": {
    "dita-ot": {
      "url": "https://webhelp-mcp.vercel.app/www.dita-ot.org/dev"
    }
  }
}

Search query: “transformation types” Fetch ID: 0:topics/output-formats.html Result: Complete guide to DITA-OT output formats in Markdown

Fetching Oxygen XML Documentation

// MCP configuration
{
  "mcpServers": {
    "oxygen": {
      "url": "https://webhelp-mcp.vercel.app/www.oxygenxml.com/doc/versions/26.1/ug-editor"
    }
  }
}

Search query: “content completion” Fetch ID: 0:topics/streamline-with-content-completion.html Result: Full content completion documentation with examples

Error Handling

Invalid Document ID

{
  "error": "Unknown base URL index: 5"
}

This error occurs when the index in the ID doesn’t correspond to any configured base URL. Always use IDs from search results.

Document Not Found

{
  "error": "Failed to download file: HTTP 404"
}

Common causes:

Document was moved or deleted
Search index is outdated
Incorrect document path

Conversion Failures

If Turndown encounters problematic HTML, it may produce malformed Markdown. The server doesn’t validate output quality.

If Markdown output looks broken, try viewing the original HTML at the returned URL.

Performance Considerations

Download Time

Each fetch downloads the complete HTML page from the WebHelp site:

let htmlContent = await downloadFile(fullUrl);

Typical page size: 10-100 KB
Download time: 100-500ms depending on network and server
No caching: each fetch downloads fresh content

Conversion Time

HTML to Markdown conversion is fast but depends on document size:

Small documents (< 50 KB): < 50ms
Medium documents (50-200 KB): 50-200ms
Large documents (> 200 KB): 200ms-1s

The MCP server processes requests synchronously. Large documents may cause timeouts in some AI tools.

Best Practices

Search First

Always search before fetching to find the right document IDs

Use Exact IDs

Never construct IDs manually — always use search results

Fetch Selectively

Only fetch documents you need — searches are much faster

Check URLs

Include the original URL in citations for user reference

Markdown Quality

The conversion quality depends on the WebHelp HTML structure: Well-Converted Elements:

Headings and paragraphs
Code blocks with syntax highlighting
Tables and lists
Links and images
Bold and italic text

Potentially Problematic:

Custom WebHelp widgets
JavaScript-rendered content
Complex CSS layouts
Embedded multimedia

Most Oxygen WebHelp content converts cleanly because it uses semantic HTML generated from DITA.

Next Steps

Search Tool

Learn how to find documents to fetch

Federated Search

Fetch from multiple sites

Integration Guide

Set up with Claude Desktop

Deploy Your Own

Host a private instance

Get Started

Core Features

Integration

Deployment

Fetch Tool

Fetch Tool

How It Works

Implementation

Document ID Resolution

Article Extraction

HTML to Markdown Conversion

MCP Tool Definition

Parameters

Return Value

Result Fields

Usage Examples

Claude Desktop Workflow

Fetching DITA-OT Documentation

Fetching Oxygen XML Documentation

Error Handling

Invalid Document ID

Document Not Found

Conversion Failures

Performance Considerations

Download Time

Conversion Time

Best Practices

Search First

Use Exact IDs

Fetch Selectively

Check URLs

Markdown Quality

Next Steps

Search Tool

Federated Search

Integration Guide

Deploy Your Own

Build docs developers (and LLMs) love

Get Started

Core Features

Integration

Deployment

​Fetch Tool

​How It Works

​Implementation

​Document ID Resolution

​Article Extraction

​HTML to Markdown Conversion

​MCP Tool Definition

​Parameters

​Return Value

​Result Fields

​Usage Examples

​Claude Desktop Workflow

​Fetching DITA-OT Documentation

​Fetching Oxygen XML Documentation

​Error Handling

​Invalid Document ID

​Document Not Found

​Conversion Failures

​Performance Considerations

​Download Time

​Conversion Time

​Best Practices

Search First

Use Exact IDs

Fetch Selectively

Check URLs

​Markdown Quality

​Next Steps

Search Tool

Federated Search

Integration Guide

Deploy Your Own

Build docs developers (and LLMs) love

Fetch Tool

How It Works

Implementation

Document ID Resolution

Article Extraction

HTML to Markdown Conversion

MCP Tool Definition

Parameters

Return Value

Result Fields

Usage Examples

Claude Desktop Workflow

Fetching DITA-OT Documentation

Fetching Oxygen XML Documentation

Error Handling

Invalid Document ID

Document Not Found

Conversion Failures

Performance Considerations

Download Time

Conversion Time

Best Practices

Markdown Quality

Next Steps