Hentai2Read - Universal Manga Downloader

The Hentai2Read handler is the fastest extractor in the project, using pure JSON parsing to extract image URLs without browser automation or AI.

Supported domains

The handler supports Hentai2Read domains:

@staticmethod
def get_supported_domains() -> list:
    return ["hentai2read"]

Location: ~/workspace/source/core/sites/h2r.py:17-19 This matches any domain containing "hentai2read" (e.g., hentai2read.com, hentai2read.org).

Supported URLs

https://hentai2read.com/[manga-title]/[chapter-number]
https://hentai2read.com/[manga-title]

Example:

https://hentai2read.com/my_favorite_doujin/1
https://hentai2read.com/my_favorite_doujin/1/2  (page number ignored)

Extraction technology

JSON parsing

Hentai2Read embeds chapter metadata directly in the page HTML as a JavaScript variable:

var gData = {
    "images": ["/path/to/image1.jpg", "/path/to/image2.jpg", ...],
    "title": "Chapter Title",
    ...
};

The handler extracts and parses this data:

# Extract gData JSON variable
gdata_match = re.search(r'var gData\s*=\s*(\{.*?\});', html, re.DOTALL)

if gdata_match:
    json_str = gdata_match.group(1)
    
    # Extract images array
    images_match = re.search(
        r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', 
        json_str, 
        re.DOTALL
    )
    
    if images_match:
        img_list_raw = images_match.group(1)
        raw_paths = re.findall(r'["\']([^"\']+)["\']', img_list_raw)
        paths = [p.replace('\\/', '/') for p in raw_paths]

Location: ~/workspace/source/core/sites/h2r.py:40-49 This approach:

Requires no JavaScript execution
No browser automation needed
No AI processing required
Extremely fast and reliable

CDN URL construction

Images are hosted on CDN servers. The handler constructs full URLs:

base_url = "https://static.hentai.direct/hentai"

# Try to find actual CDN from page
cdn_match = re.search(r'src=["\']https://[^"/]+/hentai)/', html)
if cdn_match: 
    base_url = cdn_match.group(1)

# Construct full URLs
image_urls = [
    f"{base_url}{p}" if not p.startswith("http") else p 
    for p in paths
]

Location: ~/workspace/source/core/sites/h2r.py:51-56 This ensures images are downloaded from the correct CDN server.

Title extraction

The title is also embedded in the gData variable:

pdf_name = "hentai2read_chapter.pdf"

title_match = re.search(
    r'[\'"]
title[\'"]
\s*:\s*[\'"]
(.*?)[\'"]
', 
    json_str
)

if title_match:
    safe = clean_filename(title_match.group(1).strip())
    pdf_name = f"{safe}.pdf"

Location: ~/workspace/source/core/sites/h2r.py:59-63 The clean_filename utility removes invalid characters for file systems.

Implementation details

Class structure

class H2RHandler(BaseSiteHandler):
    """Handler for Hentai2Read website."""
    
    @staticmethod
    def get_supported_domains() -> list:
        return ["hentai2read"]
    
    async def process(
        self,
        url: str,
        log_callback: Callable[[str], None],
        check_cancel: Callable[[], bool],
        progress_callback: Optional[Callable[[int, int], None]] = None
    ) -> None:
        """Process Hentai2Read URL."""
        ...

Location: ~/workspace/source/core/sites/h2r.py:14-28

Crawl4AI usage

Despite being JSON-based, H2R still uses Crawl4AI for initial page fetching:

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(url=url, bypass_cache=True)
    if not result.success:
        log_callback(f"[ERROR] Page load failed: {result.error_message}")
        return
    
    html = result.html

Location: ~/workspace/source/core/sites/h2r.py:31-37 This could be optimized to use aiohttp directly, but Crawl4AI provides:

Consistent interface with other handlers
Built-in caching
Error handling

Error handling

The handler includes comprehensive error handling:

try:
    json_str = gdata_match.group(1)
    images_match = re.search(r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', json_str, re.DOTALL)
    
    if images_match:
        # ... extraction logic ...
        await download_and_make_pdf(...)
    else:
        log_callback("[ERROR] Could not extract image list.")
except Exception as e:
    log_callback(f"[ERROR] Error processing metadata: {e}")
else:
    log_callback("[ERROR] Chapter data not found.")

Location: ~/workspace/source/core/sites/h2r.py:43-79

Headers and configuration

H2R requires minimal headers:

HEADERS_H2R = {
    "User-Agent": "Mozilla/5.0 ...",
    "Referer": "https://hentai2read.com/"
}

Passed to download_and_make_pdf for image downloads.

Usage examples

Single chapter download

from core.handler import process_url

await process_url(
    "https://hentai2read.com/my_doujin/1",
    log_callback=print,
    check_cancel=lambda: False,
    progress_callback=lambda current, total: print(f"{current}/{total}")
)

Output: PDF/My Doujin - Chapter 1.pdf

Via web interface

Start: START_WEB_VERSION.bat
Navigate to: http://localhost:3000
Paste URL: https://hentai2read.com/my_doujin/1
Download completes in seconds

Via Discord bot

!descargar https://hentai2read.com/my_doujin/1

Bot responds with PDF attachment or GoFile link.

Performance characteristics

Speed comparison

Handler	Time for 50 images	Reason
H2R	~5 seconds	Pure JSON parsing
M440	~8 seconds	Regex extraction
TMO-H	~15 seconds	AI processing
Hitomi	~60 seconds	Page-by-page browser

Times are estimates and depend on network speed and image sizes.

Resource usage

CPU: Minimal (no AI or JS execution)
Memory: Low (~50MB per chapter)
Network: Only downloads actual images, no extra requests
API calls: None required

Known limitations

Single chapter only

H2R handler does not support automatic series detection or bulk downloads. You must provide individual chapter URLs.

No lazy loading

Since extraction happens server-side (JSON parsing), there’s no need for lazy loading scripts or browser automation.

Escape sequences

The handler handles escaped forward slashes in JSON:

paths = [p.replace('\\/', '/') for p in raw_paths]

Location: ~/workspace/source/core/sites/h2r.py:49 This ensures URLs are properly formatted.

Troubleshooting

”Chapter data not found”

This error means the gData variable wasn’t found in the HTML. Possible causes:

Site structure changed
URL is invalid (404 page)
Page requires authentication
JavaScript variable name changed

Solution: Inspect the page source and update the regex pattern.

”Could not extract image list”

The gData variable was found, but the images array is missing:

images_match = re.search(r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', json_str, re.DOTALL)
if images_match:
    # ... success ...
else:
    log_callback("[ERROR] Could not extract image list.")

Location: ~/workspace/source/core/sites/h2r.py:45-75 Solution: Check if the JSON structure has changed.

CDN errors (404 on images)

If the base CDN URL is wrong, image downloads will fail:

cdn_match = re.search(r'src=["\']https://[^"/]+/hentai)/', html)
if cdn_match: 
    base_url = cdn_match.group(1)

Location: ~/workspace/source/core/sites/h2r.py:52-54 Solution: Update the fallback URL in the code:

base_url = "https://static.hentai.direct/hentai"  # Update if CDN changes

Advantages over other methods

vs. Browser automation (Hitomi, NHentai)

10x faster
No Playwright dependencies
No headless browser overhead
More reliable (no Cloudflare issues)

vs. AI extraction (ZonaTMO, TMO-H)

No API key required
No LLM costs
No parsing errors
Deterministic results

vs. Regex-only (M440)

Cleaner extraction (structured JSON)
More maintainable
Less prone to breaking on HTML changes

Code walkthrough

Here’s the complete extraction flow:

Fetch page HTML

result = await crawler.arun(url=url, bypass_cache=True)
html = result.html

Extract gData variable

gdata_match = re.search(r'var gData\s*=\s*(\{.*?\});', html, re.DOTALL)
json_str = gdata_match.group(1)

Parse images array

images_match = re.search(r'[\'"]
images[\'"]
\s*:\s*\[(.*?)\]', json_str, re.DOTALL)
raw_paths = re.findall(r'["\']([^"\']+)["\']', img_list_raw)

Construct URLs

image_urls = [f"{base_url}{p}" if not p.startswith("http") else p for p in paths]

Generate PDF

await download_and_make_pdf(
    image_urls,
    pdf_name,
    config.HEADERS_H2R,
    log_callback,
    check_cancel,
    progress_callback,
    open_result=config.OPEN_RESULT_ON_FINISH
)

Location: Full flow in ~/workspace/source/core/sites/h2r.py:31-73

Next steps

M440

Compare with regex-based extraction

Hitomi.la

See why some sites need browser automation

Utils

Explore download_and_make_pdf

Configuration

Configure headers and output paths

Get Started

Deployment

Supported Sites

Core Concepts

​Supported domains

​Supported URLs

​Extraction technology

​JSON parsing

​CDN URL construction

​Title extraction

​Implementation details

​Class structure

​Crawl4AI usage

​Error handling

​Headers and configuration

​Usage examples

​Single chapter download

​Via web interface

​Via Discord bot

​Performance characteristics

​Speed comparison

​Resource usage

​Known limitations

​Single chapter only

​No lazy loading

​Escape sequences

​Troubleshooting

​”Chapter data not found”

​”Could not extract image list”

​CDN errors (404 on images)

​Advantages over other methods

​vs. Browser automation (Hitomi, NHentai)

​vs. AI extraction (ZonaTMO, TMO-H)

​vs. Regex-only (M440)

​Code walkthrough

​Next steps

M440

Hitomi.la

Utils

Configuration

Build docs developers (and LLMs) love

Supported domains

Supported URLs

Extraction technology

JSON parsing

CDN URL construction

Title extraction

Implementation details

Class structure

Crawl4AI usage

Error handling

Headers and configuration

Usage examples

Single chapter download

Via web interface

Via Discord bot

Performance characteristics

Speed comparison

Resource usage

Known limitations

Single chapter only

No lazy loading

Escape sequences

Troubleshooting

”Chapter data not found”

”Could not extract image list”

CDN errors (404 on images)

Advantages over other methods

vs. Browser automation (Hitomi, NHentai)

vs. AI extraction (ZonaTMO, TMO-H)

vs. Regex-only (M440)

Code walkthrough

Next steps