Skip to main content
The Hitomi.la handler uses advanced browser automation with Playwright to bypass anti-bot protections and download high-quality images page-by-page.

Supported domains

@staticmethod
def get_supported_domains() -> list:
    return ["hitomi.la"]
Location: ~/workspace/source/core/sites/hitomi.py:19-21

Supported URLs

Hitomi URLs come in two formats:
https://hitomi.la/reader/[gallery-id].html
https://hitomi.la/[type]/[title-slug]-[gallery-id].html
Types: doujinshi, manga, galleries, cg, etc. The handler extracts the numeric ID from either format:
id_match = re.search(r'[-/](\d+)\.html', url)
if not id_match:
    log_callback("[ERROR] Could not extract ID from URL.")
    return
gallery_id = int(id_match.group(1))
Location: ~/workspace/source/core/sites/hitomi.py:34-37

Why browser automation?

Hitomi.la implements aggressive anti-bot protection:
  • 403 Forbidden on direct image requests without proper headers
  • 404 Not Found if referrer is missing or incorrect
  • Dynamic image URLs that change based on browser state
  • JavaScript-required reader interface
Standard HTTP requests and Crawl4AI fail on Hitomi. Only full browser automation works reliably.

Extraction technology

Playwright stealth mode

The handler uses Playwright with stealth techniques:
from playwright.async_api import async_playwright

args = [
    "--no-sandbox", 
    "--disable-setuid-sandbox", 
    "--start-maximized"
]

browser = await p.chromium.launch(
    headless=is_headless, 
    args=args
)

context = await browser.new_context(
    user_agent=config.USER_AGENT,
    viewport={'width': 1280, 'height': 720}
)
Location: ~/workspace/source/core/sites/hitomi.py:57-64

Headless detection

The handler automatically determines if it should run headless:
is_headless = os.getenv("HEADLESS", "false").lower() == "true" or not os.getenv("DISPLAY")
if os.name == 'nt':  # Windows
    is_headless = False
Location: ~/workspace/source/core/sites/hitomi.py:53-54
  • Linux servers: Defaults to headless
  • Windows: Always visible (better success rate)
  • Override: Set HEADLESS=true in .env

Page-by-page extraction

Unlike other handlers that extract all URLs at once, Hitomi downloads images one by one:
reader_url = f"https://hitomi.la/reader/{gallery_id}.html#1"
await page.goto(reader_url, wait_until="domcontentloaded")

# Get total images from JavaScript variable
total_images = await page.evaluate(
    "() => window.galleryinfo ? window.galleryinfo.files.length : 0"
)

for i in range(1, total_images + 1):
    # Update hash to go to next image
    await page.evaluate(f"location.hash = '#{i}'")
    
    # Wait for image to update
    await page.wait_for_function(
        """(selector) => {
            const img = document.querySelector(selector);
            return img && img.src && img.src.indexOf('http') === 0;
        }""", 
        arg="div#comicImages img", 
        timeout=10000
    )
    
    # Extract image info
    img_info = await page.evaluate("""(selector) => {
        const img = document.querySelector(selector);
        return {
            src: img.src, 
            width: img.naturalWidth, 
            height: img.naturalHeight
        };
    }""", "div#comicImages img")
Location: ~/workspace/source/core/sites/hitomi.py:69-122

Why page-by-page?

  1. Dynamic URLs: Image URLs are generated per page via JavaScript
  2. Protection: Bulk requests trigger rate limiting
  3. Quality: Ensures high-quality images (not thumbnails)
  4. Reliability: Handles navigation errors per-page

Downloading with proper referrer

Hitomi requires the reader URL as referrer:
img_src = img_info['src']
headers = {"Referer": f"https://hitomi.la/reader/{gallery_id}.html"}

response = await page.request.get(img_src, headers=headers)

if response.status == 200:
    data = await response.body()
    ext = img_src.split('.')[-1]
    if '?' in ext: 
        ext = ext.split('?')[0]
    
    filename = f"{i:03d}.{ext}"
    filepath = os.path.join(temp_dir, filename)
    
    with open(filepath, 'wb') as f:
        f.write(data)
    
    download_targets.append(filepath)
Location: ~/workspace/source/core/sites/hitomi.py:124-141 Using page.request.get() instead of aiohttp ensures:
  • Cookies are maintained
  • Browser context is used
  • Referrer is properly set

Temporary file handling

Since images are downloaded sequentially, they’re stored in a temp folder first:
current_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
temp_dir = os.path.join(current_dir, config.TEMP_FOLDER_NAME)

if os.path.exists(temp_dir):
    shutil.rmtree(temp_dir)
os.makedirs(temp_dir, exist_ok=True)

download_targets = []  # Will store file paths
Location: ~/workspace/source/core/sites/hitomi.py:43-49 After all downloads complete:
if download_targets:
    pdf_name = f"{clean_filename(title)}.pdf"
    finalize_pdf_flow(
        download_targets, 
        pdf_name, 
        log_callback, 
        temp_dir,
        open_result=config.OPEN_RESULT_ON_FINISH
    )
Location: ~/workspace/source/core/sites/hitomi.py:164-172 The finalize_pdf_flow utility:
  1. Generates the PDF from downloaded files
  2. Cleans up the temp directory
  3. Opens the result if configured

Title extraction

The handler extracts the gallery title from the page:
title = f"Hitomi_{gallery_id}"
page_title = await page.title()

if page_title:
    clean_title = re.sub(r'[\\/*?:"<>|]', '', page_title).strip()
    title = clean_title if clean_title else title

log_callback(f"[INFO] Title detected: {title}")
Location: ~/workspace/source/core/sites/hitomi.py:77-81 If title extraction fails, it defaults to Hitomi_{gallery_id}. Sometimes window.galleryinfo is not immediately available:
total_images = await page.evaluate(
    "() => window.galleryinfo ? window.galleryinfo.files.length : 0"
)

if total_images == 0:
    log_callback("[INFO] 'galleryinfo' not detected, trying fallback...")
    try:
        await page.wait_for_function(
            "() => window.galleryinfo && window.galleryinfo.files.length > 0", 
            timeout=5000
        )
        total_images = await page.evaluate("() => window.galleryinfo.files.length")
    except:
        log_callback("[WARN] Could not determine total images. Estimating...")
        total_images = 9999  # Arbitrary limit
Location: ~/workspace/source/core/sites/hitomi.py:84-94 If fallback also fails, the handler estimates 9999 pages and stops when errors occur:
except Exception as e:
    log_callback(f"[ERROR] Error on page {i}: {e}")
    if total_images == 9999 and i > 5:
        log_callback("[INFO] Possible end of gallery.")
        break
Location: ~/workspace/source/core/sites/hitomi.py:150-154

Rate limiting and delays

The handler includes a 500ms delay between pages:
await page.wait_for_timeout(500)
Location: ~/workspace/source/core/sites/hitomi.py:148 This prevents:
  • Rate limiting by the server
  • Browser detection as bot
  • Connection errors

Usage examples

from core.handler import process_url

await process_url(
    "https://hitomi.la/reader/1234567.html",
    log_callback=print,
    check_cancel=lambda: False,
    progress_callback=lambda current, total: print(f"Page {current}/{total}")
)
Output: PDF/Gallery Title.pdf

From info page

await process_url(
    "https://hitomi.la/doujinshi/my-favorite-doujin-1234567.html",
    log_callback=print,
    check_cancel=lambda: False
)
The handler extracts the ID 1234567 and navigates to the reader.

Via web interface

  1. Launch: START_WEB_VERSION.bat
  2. Open: http://localhost:3000
  3. Paste URL: https://hitomi.la/reader/1234567.html
  4. Watch real-time page-by-page progress
Hitomi downloads are slower due to page-by-page navigation, but this ensures maximum quality and bypasses protections.

Performance characteristics

  • Speed: Slow (~30-60 seconds for 50 images)
  • Reliability: Very high (bypasses all protections)
  • Resource usage: High (full browser instance)
  • Quality: Maximum (original resolution)

Comparison

MetricHitomiH2RM440
Time (50 imgs)~60s~5s~8s
Memory~500MB~50MB~80MB
CPUHighLowLow
Reliability99%95%90%

Known limitations

No series support

Hitomi handler only supports single galleries, not bulk series downloads.

Visible browser on Windows

For best reliability on Windows, the browser runs in visible mode:
if os.name == 'nt':  # Windows
    is_headless = False
This can be distracting but significantly improves success rate.

Slow for large galleries

Galleries with 200+ images can take 5-10 minutes. Consider using progress callbacks:
progress_callback=lambda c, t: print(f"Progress: {c}/{t} ({c/t*100:.1f}%)", end="\r")

Troubleshooting

Browser launch fails

If Playwright fails to launch:
playwright install chromium
Ensure Chromium is installed.

Timeout on wait_for_function

If image loading times out:
await page.wait_for_function(..., timeout=10000)  # Increase timeout
Adjust the timeout in the code (currently 10 seconds).

403/404 errors

If downloads fail with 403/404:
  1. Verify referrer is set correctly
  2. Check if User-Agent is up-to-date
  3. Try visible browser mode (HEADLESS=false)
  4. Increase delays between requests

Images not loading

If window.galleryinfo is undefined:
await page.wait_for_timeout(2000)  # Wait longer
Increase initial wait time after page load.

Advanced configuration

Custom viewport

You can modify the viewport size:
context = await browser.new_context(
    user_agent=config.USER_AGENT,
    viewport={'width': 1920, 'height': 1080}  # Larger viewport
)
Location: ~/workspace/source/core/sites/hitomi.py:61-64

Additional stealth options

For extra stealth, add more browser args:
args = [
    "--no-sandbox",
    "--disable-setuid-sandbox",
    "--start-maximized",
    "--disable-blink-features=AutomationControlled",
    "--disable-dev-shm-usage"
]

Next steps

NHentai

Similar browser-based approach for NHentai

Configuration

Configure headless mode and paths

Utils

Learn about finalize_pdf_flow

Architecture

Understand the handler system

Build docs developers (and LLMs) love