Skip to main content

Semantic Search

Semantic search provides AI-powered natural language understanding for WebHelp documentation queries. When available, it offers more intuitive search results than traditional keyword matching.

How It Works

Semantic search is powered by Oxygen XML’s Feedback service:
  1. Extracts the deployment token from the WebHelp site
  2. Sends the query to the Oxygen Feedback API
  3. Receives AI-ranked results based on semantic relevance
  4. Falls back to index-based search if unavailable
Semantic search only works for WebHelp sites that have Oxygen Feedback enabled. The server automatically detects availability and falls back gracefully.

Implementation

Here’s the complete semantic search implementation:
// From webhelp-search-client.ts:95-171
async semanticSearch(
  query: string,
  baseUrl: string,
  pageSize: number = 10
): Promise<SearchResult> {
  try {
    // Extract deployment token from the WebHelp page
    const mainPage = await downloadFile(baseUrl);
    const match = mainPage.match(/feedback-init[^>]+deploymentToken=([^"'>]+)/);
    if (!match) {
      return { error: 'Deployment token not found', results: [] };
    }
    const token = match[1];

    // Prepare search request
    const postData = JSON.stringify({
      searchQuery: query,
      facets: [],
      currentPage: 1,
      pageSize,
      exactSearch: false,
      defaultJoinOperator: 'AND',
      highlight: false,
      indexFields: []
    });

    // Configure proxy if needed
    const proxyUrl =
      process.env.HTTPS_PROXY ||
      process.env.https_proxy ||
      process.env.HTTP_PROXY ||
      process.env.http_proxy;

    const options: any = {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Content-Length': Buffer.byteLength(postData)
      }
    };
    if (proxyUrl) {
      options.agent = new HttpsProxyAgent(proxyUrl);
    }

    // Execute search request
    const dataStr: string = await new Promise((resolve, reject) => {
      const req = https.request(
        `https://feedback.oxygenxml.com/api/html-content/search?token=${token}`,
        options,
        res => {
          if (res.statusCode && res.statusCode >= 200 && res.statusCode < 300) {
            let body = '';
            res.on('data', chunk => (body += chunk));
            res.on('end', () => resolve(body));
          } else {
            reject(new Error(`HTTP ${res.statusCode}: ${res.statusMessage}`));
          }
        }
      );
      req.on('error', reject);
      req.write(postData);
      req.end();
    });

    // Parse and format results
    const data: any = JSON.parse(dataStr);
    const results = (data.documents || []).map((doc: any, idx: number) => {
      const url = doc.fields?.uri || '';
      const rel = url.startsWith(baseUrl) ? url.substring(baseUrl.length) : url;
      return {
        id: `0:${rel}`,
        title: doc.fields?.title || '',
        url,
        score: doc.score ?? 0
      };
    });

    return { results };
  } catch (error: any) {
    return { error: `Semantic search failed: ${error.message}`, results: [] };
  }
}

Token Extraction

The deployment token is embedded in the WebHelp page HTML:
const mainPage = await downloadFile(baseUrl);
const match = mainPage.match(/feedback-init[^>]+deploymentToken=([^"'>]+)/);
if (!match) {
  return { error: 'Deployment token not found', results: [] };
}
const token = match[1];
Typical HTML:
<script>
  oxygenFeedbackInit({
    deploymentToken: 'abc123def456',
    productName: 'My Documentation',
    productVersion: '1.0'
  });
</script>
If the WebHelp site doesn’t include the feedback-init script with a deploymentToken, semantic search is unavailable.

API Request

The server posts to the Oxygen Feedback API: Endpoint:
https://feedback.oxygenxml.com/api/html-content/search?token={token}
Request body:
{
  "searchQuery": "how do I validate DITA maps",
  "facets": [],
  "currentPage": 1,
  "pageSize": 10,
  "exactSearch": false,
  "defaultJoinOperator": "AND",
  "highlight": false,
  "indexFields": []
}
Parameters:
  • searchQuery — Natural language query
  • pageSize — Maximum results (default: 10)
  • exactSearch — Exact phrase matching (always false)
  • defaultJoinOperator — Term matching mode (always “AND”)
  • highlight — Return highlighted snippets (always false)

Response Format

The Oxygen Feedback API returns:
{
  "documents": [
    {
      "fields": {
        "uri": "https://example.com/docs/topics/validation.html",
        "title": "Validating DITA Maps"
      },
      "score": 0.95
    },
    {
      "fields": {
        "uri": "https://example.com/docs/topics/checking-links.html",
        "title": "Checking Links in DITA"
      },
      "score": 0.82
    }
  ]
}
Scores range from 0 to 1, with higher values indicating better semantic relevance.

Search Fallback Strategy

The WebHelp MCP Server implements automatic fallback:
// From webhelp-search-client.ts:40-52
async search(query: string): Promise<SearchResult> {
  const urls = this.baseUrls;

  // Try semantic search for single sites
  if (urls.length === 1) {
    try {
      const semantic = await this.semanticSearch(query, urls[0]);
      if (!semantic.error && semantic.results.length > 0) {
        return semantic;
      }
    } catch (e) {
      // Fall back to index search
    }
  }

  // Use index search for federated or fallback
  // ...
}
Fallback conditions:
  1. Federated search (multiple sites) — Always uses index search
  2. No deployment token found — Falls back to index search
  3. Oxygen Feedback API error — Falls back to index search
  4. No results from semantic search — Falls back to index search
The fallback is transparent to the user. AI tools receive results without knowing which search method was used.

Advantages

Natural Language

Understands questions like “How do I configure output?”

Better Ranking

AI-powered relevance scoring improves result quality

Context Aware

Understands synonyms and related concepts

User Friendly

No need to learn boolean operators or exact keywords

Limitations

Single Site Only

Semantic search works only for single-site queries:
if (urls.length === 1) {
  // Try semantic search
}
Federated searches always use index-based search. Reason: Semantic scores from different Oxygen Feedback instances aren’t comparable.

Requires Oxygen Feedback

Semantic search only works if:
  1. The WebHelp site was published with Oxygen Feedback enabled
  2. The deployment token is accessible in the page HTML
  3. The Feedback service is reachable from your server
Many WebHelp sites don’t have Oxygen Feedback enabled. The server falls back gracefully, but semantic search won’t be available.

API Dependency

Semantic search depends on an external service:
https://feedback.oxygenxml.com/api/html-content/search
If this service is down or unreachable, semantic search fails and falls back to index search.

No Customization

The server always uses these search parameters:
  • exactSearch: false
  • defaultJoinOperator: 'AND'
  • highlight: false
  • pageSize: 10
These cannot be customized per query.

Proxy Support

The server respects HTTP proxy environment variables:
const proxyUrl =
  process.env.HTTPS_PROXY ||
  process.env.https_proxy ||
  process.env.HTTP_PROXY ||
  process.env.http_proxy;

if (proxyUrl) {
  options.agent = new HttpsProxyAgent(proxyUrl);
}
Supported variables:
  • HTTPS_PROXY
  • https_proxy
  • HTTP_PROXY
  • http_proxy
Set these environment variables if your server requires a proxy to reach feedback.oxygenxml.com.

Error Handling

Token Not Found

{
  "error": "Deployment token not found",
  "results": []
}
The WebHelp page doesn’t include Oxygen Feedback integration.

API Request Failed

{
  "error": "Semantic search failed: HTTP 503: Service Unavailable",
  "results": []
}
The Oxygen Feedback service is temporarily unavailable.

Network Errors

{
  "error": "Semantic search failed: ECONNREFUSED",
  "results": []
}
The server cannot reach feedback.oxygenxml.com (check proxy settings).

Performance

Request Time

Semantic search involves two HTTP requests:
  1. Download main page to extract token (~200ms)
  2. POST query to Feedback API (~300-800ms)
Total: 500-1000ms

Caching Opportunities

The deployment token could be cached to eliminate the first request:
// Potential optimization (not implemented)
const tokenCache = new Map<string, string>();
if (tokenCache.has(baseUrl)) {
  token = tokenCache.get(baseUrl);
} else {
  // Extract token and cache it
}
Token caching is not currently implemented. Each semantic search downloads the main page.

Query Tips

Good semantic queries:
  • “How do I configure PDF output?”
  • “What is the difference between a map and a topic?”
  • “Troubleshooting build errors”
  • “Best practices for reuse”
Less effective queries:
  • Single words: “PDF”, “map”, “error”
  • Boolean operators: “publishing AND PDF” (use index search instead)
  • Exact phrases in quotes (use exactSearch: true if needed)
Semantic search works best with natural questions and multi-word phrases that express intent.

Checking Availability

To check if a WebHelp site supports semantic search:
  1. View the page source
  2. Search for “feedback-init” or “deploymentToken”
  3. If found, semantic search is available
Example:
curl https://www.oxygenxml.com/doc/versions/26.1/ug-editor/ | grep deploymentToken
If output includes the token, semantic search works for that site.
FeatureSemantic SearchIndex Search
Natural language✅ Excellent❌ Limited
Boolean operators❌ Not supported✅ Supported
Federated search❌ Single site only✅ Multi-site
Performance🟡 500-1000ms🟢 200-500ms
Availability🟡 Oxygen Feedback only✅ All WebHelp sites
Result quality🟢 AI-ranked🟡 Keyword-based
Offline support❌ Requires API✅ Local index

Next Steps

Search Tool

Learn about the unified search tool

Federated Search

Query multiple sites (uses index search)

Fetch Tool

Retrieve full document content

Oxygen Feedback

Learn about Oxygen Feedback integration

Build docs developers (and LLMs) love