Skip to main content

Fetch Tool

The fetch tool retrieves complete document content from WebHelp documentation sites and converts it from HTML to clean Markdown. This enables AI assistants to analyze, cite, and answer questions with full context.

How It Works

When you invoke the fetch tool with a document ID:
  1. Resolves the document URL from the ID (format: index:path)
  2. Downloads the HTML content from the WebHelp site
  3. Extracts the main <article> element to remove navigation and UI
  4. Converts HTML to Markdown using Turndown
  5. Returns the structured content with title, text, and metadata
Always use the id field from search results rather than constructing IDs manually.

Implementation

Here’s the complete fetch implementation from the source code:
// From webhelp-search-client.ts:190-218
async fetchDocumentContent(documentId: string): Promise<{
  id: string;
  title: string;
  text: string;
  url: string;
  metadata?: any;
}> {
  const fullUrl = this.resolveDocumentUrl(documentId);
  let htmlContent = await downloadFile(fullUrl);
  
  // Extract just the article element
  let articleContent = this.extractArticleElement(htmlContent);
  
  // Convert HTML to markdown
  const turndownService = new TurndownService({
    headingStyle: 'atx',
    codeBlockStyle: 'fenced',
    bulletListMarker: '-'
  });
  
  const markdownContent = turndownService.turndown(articleContent);
  
  return {
    id: documentId,
    title: extractTitleFromContent(htmlContent) || documentId,
    text: markdownContent,
    url: fullUrl
  };
}

Document ID Resolution

Document IDs use a composite format to support federated search:
// From webhelp-search-client.ts:220-228
private resolveDocumentUrl(documentId: string): string {
  const [indexStr, ...pathParts] = documentId.split(':');
  const baseUrl = this.baseUrls[Number(indexStr)];
  if (!baseUrl) {
    throw new Error(`Unknown base URL index: ${indexStr}`);
  }
  const path = pathParts.join(':');
  return `${baseUrl}${path}`;
}
ID Format: index:path
  • index — Zero-based index into the baseUrls array
  • path — Relative path to the document
Examples:
  • 0:topics/introduction.html — First site, introduction topic
  • 1:reference/api.html — Second site, API reference
  • 0:topics/config/advanced.html — Path with multiple segments
The colon separator means paths themselves can contain colons. The implementation splits on the first colon only.

Article Extraction

The fetch tool extracts only the main content to avoid sending navigation, headers, and footers to the AI:
// From webhelp-search-client.ts:230-241
extractArticleElement(htmlContent: string): string {
  const dom = new JSDOM(htmlContent);
  const document = dom.window.document;
  
  const articleElement = document.querySelector('article');
  if (articleElement) {
    return articleElement.outerHTML;
  }
  
  // If no article element found, return the original content
  return htmlContent;
}
If no <article> element exists, the entire HTML is converted. This may include navigation and other UI elements.

HTML to Markdown Conversion

The server uses Turndown to convert HTML to Markdown:
const turndownService = new TurndownService({
  headingStyle: 'atx',      // Use # syntax for headings
  codeBlockStyle: 'fenced',  // Use ``` for code blocks
  bulletListMarker: '-'      // Use - for unordered lists
});

const markdownContent = turndownService.turndown(articleContent);
Conversion features:
  • ATX-style headings (#, ##, ###)
  • Fenced code blocks with language detection
  • Preserved code formatting and syntax
  • Tables converted to Markdown tables
  • Links and images preserved
  • Lists properly formatted

MCP Tool Definition

Here’s how the fetch tool is exposed via the Model Context Protocol:
// From app/[...site]/route.ts:85-112
server.tool(
  "fetch",
  "Retrieve complete document content by ID for detailed analysis and citation. Use this after finding relevant documents with the search tool.",
  {
    id: z.string().describe("Document ID from search results")
  },
  async ({ id }) => {
    console.log('Tool "fetch" invoked with params:', { id });
    try {
      const fetchResult = await searchClient.fetchDocumentContent(id);

      return {
        content: [{
          type: "text",
          text: JSON.stringify(fetchResult)
        }]
      };
    } catch (error: any) {
      return {
        content: [{
          type: "text",
          text: `Fetch failed: ${error.message}`
        }],
        isError: true
      };
    }
  }
);

Parameters

id
string
required
Document ID from search results. Format: index:path where index is the site index and path is the relative document path.

Return Value

The fetch tool returns a JSON object with the document content:
{
  "id": "0:topics/getting-started.html",
  "title": "Getting Started with WebHelp",
  "text": "# Getting Started with WebHelp\n\nWebHelp is a...\n\n## Installation\n\n...",
  "url": "https://example.com/docs/topics/getting-started.html"
}

Result Fields

id
string
The document ID that was requested
title
string
Document title extracted from the HTML <title> tag or page metadata
text
string
Complete document content converted to Markdown format
url
string
Full URL to the original HTML document
metadata
object
Additional metadata if available (currently unused)

Usage Examples

Claude Desktop Workflow

User: Find information about DITA map validation

Claude: [Searches and finds results]
Found 5 relevant documents. Let me fetch the most relevant one.

Claude: [Fetches document 0:topics/validation.html]

Based on the documentation, DITA map validation checks for:
1. Valid topic references
2. Consistent metadata
3. Proper hierarchy structure...

Fetching DITA-OT Documentation

// MCP configuration
{
  "mcpServers": {
    "dita-ot": {
      "url": "https://webhelp-mcp.vercel.app/www.dita-ot.org/dev"
    }
  }
}
Search query: “transformation types” Fetch ID: 0:topics/output-formats.html Result: Complete guide to DITA-OT output formats in Markdown

Fetching Oxygen XML Documentation

// MCP configuration
{
  "mcpServers": {
    "oxygen": {
      "url": "https://webhelp-mcp.vercel.app/www.oxygenxml.com/doc/versions/26.1/ug-editor"
    }
  }
}
Search query: “content completion” Fetch ID: 0:topics/streamline-with-content-completion.html Result: Full content completion documentation with examples

Error Handling

Invalid Document ID

{
  "error": "Unknown base URL index: 5"
}
This error occurs when the index in the ID doesn’t correspond to any configured base URL. Always use IDs from search results.

Document Not Found

{
  "error": "Failed to download file: HTTP 404"
}
Common causes:
  • Document was moved or deleted
  • Search index is outdated
  • Incorrect document path

Conversion Failures

If Turndown encounters problematic HTML, it may produce malformed Markdown. The server doesn’t validate output quality.
If Markdown output looks broken, try viewing the original HTML at the returned URL.

Performance Considerations

Download Time

Each fetch downloads the complete HTML page from the WebHelp site:
let htmlContent = await downloadFile(fullUrl);
  • Typical page size: 10-100 KB
  • Download time: 100-500ms depending on network and server
  • No caching: each fetch downloads fresh content

Conversion Time

HTML to Markdown conversion is fast but depends on document size:
  • Small documents (< 50 KB): < 50ms
  • Medium documents (50-200 KB): 50-200ms
  • Large documents (> 200 KB): 200ms-1s
The MCP server processes requests synchronously. Large documents may cause timeouts in some AI tools.

Best Practices

Search First

Always search before fetching to find the right document IDs

Use Exact IDs

Never construct IDs manually — always use search results

Fetch Selectively

Only fetch documents you need — searches are much faster

Check URLs

Include the original URL in citations for user reference

Markdown Quality

The conversion quality depends on the WebHelp HTML structure: Well-Converted Elements:
  • Headings and paragraphs
  • Code blocks with syntax highlighting
  • Tables and lists
  • Links and images
  • Bold and italic text
Potentially Problematic:
  • Custom WebHelp widgets
  • JavaScript-rendered content
  • Complex CSS layouts
  • Embedded multimedia
Most Oxygen WebHelp content converts cleanly because it uses semantic HTML generated from DITA.

Next Steps

Search Tool

Learn how to find documents to fetch

Federated Search

Fetch from multiple sites

Integration Guide

Set up with Claude Desktop

Deploy Your Own

Host a private instance

Build docs developers (and LLMs) love