Skip to main content
GET
/
api
/
fetchHtml
Fetch HTML
curl --request GET \
  --url https://api.example.com/api/fetchHtml
{
  "success": true,
  "html": "<string>",
  "error": {
    "code": "<string>",
    "message": "<string>"
  }
}

Overview

The Fetch HTML endpoint retrieves HTML content from a given URL and returns a cleaned version with unnecessary elements removed. This endpoint strips out scripts, styles, navigation, ads, and other non-content elements, leaving only the main body content.

Request

url
string
required
The URL to fetch HTML content from. Must be a valid, well-formed URL.

Example Request

curl -X GET "https://yourdomain.com/api/fetchHtml?url=https://example.com/recipe/chocolate-chip-cookies"
curl -X GET "https://yourdomain.com/api/fetchHtml?url=https%3A%2F%2Fexample.com%2Frecipe%2Fchocolate-chip-cookies"

Response

Success Response

success
boolean
Indicates whether the operation was successful.
html
string
The cleaned HTML content from the page body, with scripts, styles, and other non-content elements removed.
{
  "success": true,
  "html": "<div class=\"recipe\">...</div>"
}

Error Response

success
boolean
Always false for error responses.
error
object
Error details object.
code
string
Error code identifier (e.g., ERR_INVALID_URL, ERR_TIMEOUT, ERR_FETCH_FAILED).
message
string
Human-readable error message.

HTML Cleaning Process

The endpoint performs aggressive cleaning to remove:

Removed Elements

  • Scripts and styles: <script>, <style>, <noscript>, <link>, <meta>
  • Media elements: <svg>, <symbol>, <img>, <iframe>, <video>, <audio>, <canvas>
  • Interactive elements: <button>, <form>, <input>, <select>, <option>
  • Navigation: .navbar, .header, .footer, .sidebar, .breadcrumb, .nav
  • Advertisements: .ad, .ads, .sponsor, .promo, .adsbygoogle, .outbrain, .taboola
  • Social and sharing: .social, .share, .yummly-share
  • Ratings and comments: .rating, .comments, .comment, .rmp-rating-widget
  • App banners: .mobile-banner, .app-banner, .push-modal
  • WordPress blocks: .wp-block-*, .widget
  • Utilities: .print-btn, .scroll-to-top, .search-box, .tooltips
The endpoint preserves only the visible body content after removing these elements.

Error Handling

Error CodeStatusDescription
ERR_INVALID_URL200URL parameter is missing or invalid format
ERR_NO_RECIPE_FOUND200Page not found (404), empty content, or no content after cleaning
ERR_FETCH_FAILED200Network error or server error (5xx)
ERR_TIMEOUT200Request timed out after 10 seconds
ERR_UNKNOWN200Unexpected error occurred

Error Response Examples

No URL provided:
{
  "success": false,
  "error": {
    "code": "ERR_INVALID_URL",
    "message": "No URL provided"
  }
}
Invalid URL format:
{
  "success": false,
  "error": {
    "code": "ERR_INVALID_URL",
    "message": "Invalid URL format"
  }
}
Request timeout:
{
  "success": false,
  "error": {
    "code": "ERR_TIMEOUT",
    "message": "Request timed out"
  }
}
Network error:
{
  "success": false,
  "error": {
    "code": "ERR_FETCH_FAILED",
    "message": "Network error occurred"
  }
}

Configuration

  • Timeout: 10 seconds (using AbortController)
  • User-Agent: Simulates Chrome browser to avoid bot detection
  • Accepted content: HTML, XHTML, XML
  • Accepted languages: English (en-US, en)

Implementation Notes

  • Uses native fetch API with AbortController for timeout handling
  • Uses cheerio for HTML parsing and manipulation
  • Returns cleaned HTML as a string (not parsed DOM)
  • Fetches pages with browser-like headers to avoid bot detection
  • Matches timeout duration with /api/urlValidator (10 seconds)
  • Returns consistent error structure using the formatError utility

Use Cases

  • Pre-processing HTML before sending to scraping services
  • Reducing payload size by removing unnecessary elements
  • Extracting main content from recipe pages
  • Preparing HTML for AI/ML parsing or analysis

Build docs developers (and LLMs) love