Skip to main content
The Hentai2Read handler is the fastest extractor in the project, using pure JSON parsing to extract image URLs without browser automation or AI.

Supported domains

The handler supports Hentai2Read domains:
@staticmethod
def get_supported_domains() -> list:
    return ["hentai2read"]
Location: ~/workspace/source/core/sites/h2r.py:17-19 This matches any domain containing "hentai2read" (e.g., hentai2read.com, hentai2read.org).

Supported URLs

https://hentai2read.com/[manga-title]/[chapter-number]
https://hentai2read.com/[manga-title]
Example:
https://hentai2read.com/my_favorite_doujin/1
https://hentai2read.com/my_favorite_doujin/1/2  (page number ignored)

Extraction technology

JSON parsing

Hentai2Read embeds chapter metadata directly in the page HTML as a JavaScript variable:
var gData = {
    "images": ["/path/to/image1.jpg", "/path/to/image2.jpg", ...],
    "title": "Chapter Title",
    ...
};
The handler extracts and parses this data:
# Extract gData JSON variable
gdata_match = re.search(r'var gData\s*=\s*(\{.*?\});', html, re.DOTALL)

if gdata_match:
    json_str = gdata_match.group(1)
    
    # Extract images array
    images_match = re.search(
        r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', 
        json_str, 
        re.DOTALL
    )
    
    if images_match:
        img_list_raw = images_match.group(1)
        raw_paths = re.findall(r'["\']([^"\']+)["\']', img_list_raw)
        paths = [p.replace('\\/', '/') for p in raw_paths]
Location: ~/workspace/source/core/sites/h2r.py:40-49 This approach:
  • Requires no JavaScript execution
  • No browser automation needed
  • No AI processing required
  • Extremely fast and reliable

CDN URL construction

Images are hosted on CDN servers. The handler constructs full URLs:
base_url = "https://static.hentai.direct/hentai"

# Try to find actual CDN from page
cdn_match = re.search(r'src=["\']https://[^"/]+/hentai)/', html)
if cdn_match: 
    base_url = cdn_match.group(1)

# Construct full URLs
image_urls = [
    f"{base_url}{p}" if not p.startswith("http") else p 
    for p in paths
]
Location: ~/workspace/source/core/sites/h2r.py:51-56 This ensures images are downloaded from the correct CDN server.

Title extraction

The title is also embedded in the gData variable:
pdf_name = "hentai2read_chapter.pdf"

title_match = re.search(
    r'[\'"]
title[\'"]
\s*:\s*[\'"]
(.*?)[\'"]
', 
    json_str
)

if title_match:
    safe = clean_filename(title_match.group(1).strip())
    pdf_name = f"{safe}.pdf"
Location: ~/workspace/source/core/sites/h2r.py:59-63 The clean_filename utility removes invalid characters for file systems.

Implementation details

Class structure

class H2RHandler(BaseSiteHandler):
    """Handler for Hentai2Read website."""
    
    @staticmethod
    def get_supported_domains() -> list:
        return ["hentai2read"]
    
    async def process(
        self,
        url: str,
        log_callback: Callable[[str], None],
        check_cancel: Callable[[], bool],
        progress_callback: Optional[Callable[[int, int], None]] = None
    ) -> None:
        """Process Hentai2Read URL."""
        ...
Location: ~/workspace/source/core/sites/h2r.py:14-28

Crawl4AI usage

Despite being JSON-based, H2R still uses Crawl4AI for initial page fetching:
async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(url=url, bypass_cache=True)
    if not result.success:
        log_callback(f"[ERROR] Page load failed: {result.error_message}")
        return
    
    html = result.html
Location: ~/workspace/source/core/sites/h2r.py:31-37 This could be optimized to use aiohttp directly, but Crawl4AI provides:
  • Consistent interface with other handlers
  • Built-in caching
  • Error handling

Error handling

The handler includes comprehensive error handling:
try:
    json_str = gdata_match.group(1)
    images_match = re.search(r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', json_str, re.DOTALL)
    
    if images_match:
        # ... extraction logic ...
        await download_and_make_pdf(...)
    else:
        log_callback("[ERROR] Could not extract image list.")
except Exception as e:
    log_callback(f"[ERROR] Error processing metadata: {e}")
else:
    log_callback("[ERROR] Chapter data not found.")
Location: ~/workspace/source/core/sites/h2r.py:43-79

Headers and configuration

H2R requires minimal headers:
HEADERS_H2R = {
    "User-Agent": "Mozilla/5.0 ...",
    "Referer": "https://hentai2read.com/"
}
Passed to download_and_make_pdf for image downloads.

Usage examples

Single chapter download

from core.handler import process_url

await process_url(
    "https://hentai2read.com/my_doujin/1",
    log_callback=print,
    check_cancel=lambda: False,
    progress_callback=lambda current, total: print(f"{current}/{total}")
)
Output: PDF/My Doujin - Chapter 1.pdf

Via web interface

  1. Start: START_WEB_VERSION.bat
  2. Navigate to: http://localhost:3000
  3. Paste URL: https://hentai2read.com/my_doujin/1
  4. Download completes in seconds

Via Discord bot

!descargar https://hentai2read.com/my_doujin/1
Bot responds with PDF attachment or GoFile link.

Performance characteristics

Speed comparison

HandlerTime for 50 imagesReason
H2R~5 secondsPure JSON parsing
M440~8 secondsRegex extraction
TMO-H~15 secondsAI processing
Hitomi~60 secondsPage-by-page browser
Times are estimates and depend on network speed and image sizes.

Resource usage

  • CPU: Minimal (no AI or JS execution)
  • Memory: Low (~50MB per chapter)
  • Network: Only downloads actual images, no extra requests
  • API calls: None required

Known limitations

Single chapter only

H2R handler does not support automatic series detection or bulk downloads. You must provide individual chapter URLs.

No lazy loading

Since extraction happens server-side (JSON parsing), there’s no need for lazy loading scripts or browser automation.

Escape sequences

The handler handles escaped forward slashes in JSON:
paths = [p.replace('\\/', '/') for p in raw_paths]
Location: ~/workspace/source/core/sites/h2r.py:49 This ensures URLs are properly formatted.

Troubleshooting

”Chapter data not found”

This error means the gData variable wasn’t found in the HTML. Possible causes:
  1. Site structure changed
  2. URL is invalid (404 page)
  3. Page requires authentication
  4. JavaScript variable name changed
Solution: Inspect the page source and update the regex pattern.

”Could not extract image list”

The gData variable was found, but the images array is missing:
images_match = re.search(r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', json_str, re.DOTALL)
if images_match:
    # ... success ...
else:
    log_callback("[ERROR] Could not extract image list.")
Location: ~/workspace/source/core/sites/h2r.py:45-75 Solution: Check if the JSON structure has changed.

CDN errors (404 on images)

If the base CDN URL is wrong, image downloads will fail:
cdn_match = re.search(r'src=["\']https://[^"/]+/hentai)/', html)
if cdn_match: 
    base_url = cdn_match.group(1)
Location: ~/workspace/source/core/sites/h2r.py:52-54 Solution: Update the fallback URL in the code:
base_url = "https://static.hentai.direct/hentai"  # Update if CDN changes

Advantages over other methods

vs. Browser automation (Hitomi, NHentai)

  • 10x faster
  • No Playwright dependencies
  • No headless browser overhead
  • More reliable (no Cloudflare issues)

vs. AI extraction (ZonaTMO, TMO-H)

  • No API key required
  • No LLM costs
  • No parsing errors
  • Deterministic results

vs. Regex-only (M440)

  • Cleaner extraction (structured JSON)
  • More maintainable
  • Less prone to breaking on HTML changes

Code walkthrough

Here’s the complete extraction flow:
  1. Fetch page HTML
result = await crawler.arun(url=url, bypass_cache=True)
html = result.html
  1. Extract gData variable
gdata_match = re.search(r'var gData\s*=\s*(\{.*?\});', html, re.DOTALL)
json_str = gdata_match.group(1)
  1. Parse images array
images_match = re.search(r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', json_str, re.DOTALL)
raw_paths = re.findall(r'["\']([^"\']+)["\']', img_list_raw)
  1. Construct URLs
image_urls = [f"{base_url}{p}" if not p.startswith("http") else p for p in paths]
  1. Generate PDF
await download_and_make_pdf(
    image_urls,
    pdf_name,
    config.HEADERS_H2R,
    log_callback,
    check_cancel,
    progress_callback,
    open_result=config.OPEN_RESULT_ON_FINISH
)
Location: Full flow in ~/workspace/source/core/sites/h2r.py:31-73

Next steps

M440

Compare with regex-based extraction

Hitomi.la

See why some sites need browser automation

Utils

Explore download_and_make_pdf

Configuration

Configure headers and output paths

Build docs developers (and LLMs) love