Skip to main content
Every source in Lightnovel Crawler is a Python class that extends Crawler. The Crawler class provides HTTP utilities, HTML parsing, concurrency helpers, and the contract your implementation must fulfil. Inheritance chain: CrawlerScraperTaskManager Source: lncrawl/core/crawler.py

Class definition

from lncrawl.core.crawler import Crawler

class MyCrawler(Crawler):
    base_url = ["https://example.com/"]

Class-level attributes

Declare these directly on the class body. They are read by the source service at import time.
base_url
Union[str, List[str]]
required
One or more URLs that identify the site this crawler handles. The first entry becomes self.home_url at runtime. The source registry uses these to match an input URL to the correct crawler.
base_url = ["https://example.com/", "https://www.example.com/"]
has_manga
bool
default:"False"
Set to True if this source serves manga, manhua, or manhwa (image-based content rather than text).
has_mtl
bool
default:"False"
Set to True if this source serves machine-translated content.
language
str
default:"\"\""
BCP-47 language code for the source language (e.g. "en", "zh", "ja"). Leave empty for multi-language sources.
is_disabled
bool
default:"False"
When True, the source is excluded from normal use. Set this if a site has shut down or permanently broken.
disable_reason
Optional[str]
default:"None"
Human-readable explanation of why the source is disabled. Only meaningful when is_disabled is True.

Instance attributes

These attributes are initialized in __init__ and are available throughout the crawler’s lifetime. You must populate the required ones inside read_novel_info().

Novel metadata

AttributeTypeDescription
novel_urlstrThe current input URL. Available in search_novel() and read_novel_info().
novel_titlestrRequired. The title of the novel.
novel_authorstrComma-separated list of author names.
novel_coverOptional[str]Absolute URL to the cover image.
is_rtlboolTrue for right-to-left text (e.g. Arabic, Hebrew). Default False.
novel_synopsisstrSynopsis or description of the novel.
novel_tagsList[str]Genre and tag names.

Content lists

AttributeTypeDescription
volumesList[Volume]Ordered list of volumes. Can remain empty if the site has no volume grouping.
chaptersList[Chapter]Required. Ordered list of all chapters.
Both volumes and chapters must contain 1-based id values. The chapters list must be populated or the crawl will produce no output.

Abstract methods

You must implement both of these in every crawler.

read_novel_info()

def read_novel_info(self) -> None:
    """Get novel title, author, cover, volumes and chapters"""
Called once before downloading begins. By the time this method returns you must have set at minimum self.novel_title and self.chapters. Typical implementation:
def read_novel_info(self) -> None:
    soup = self.get_soup(self.novel_url)

    self.novel_title = soup.select_one("h1.title").text.strip()
    self.novel_cover = self.absolute_url(soup.select_one(".cover img")["src"])

    chap_id = 0
    for a in soup.select(".chapter-list a"):
        chap_id += 1
        self.chapters.append(Chapter(
            id=chap_id,
            title=a.text.strip(),
            url=self.absolute_url(a["href"]),
        ))

download_chapter_body(chapter)

def download_chapter_body(self, chapter: Chapter) -> str:
    """Download body of a single chapter and return as clean html format."""
Called once per chapter during the download phase. Must return valid HTML as a string. Use self.cleaner.extract_contents(tag) to strip ads and normalize markup.
chapter
Chapter
required
The chapter to download. Use chapter.url to fetch the page.
Returns: str — Clean HTML content of the chapter body. Typical implementation:
def download_chapter_body(self, chapter: Chapter) -> str:
    soup = self.get_soup(chapter.url)
    body = soup.select_one("div.chapter-content")
    return self.cleaner.extract_contents(body)

Optional methods

Override any of these to add capabilities or customise behaviour.

initialize()

def initialize(self) -> None:
    pass
Called once per session before any other method. Use it to configure self.cleaner, adjust headers, or set rate limits.
def initialize(self) -> None:
    self.cleaner.bad_tags.update(["aside", "figure"])
    self.cleaner.bad_css.update([".ad-slot", ".donation-box"])

login(username_or_email, password_or_token)

def login(self, username_or_email: str, password_or_token: str) -> None:
    pass
Called once before crawling when credentials are provided. Submit the login form or store an auth token for subsequent requests.
username_or_email
str
required
Username, email address, or API key depending on the site.
password_or_token
str
required
Password or authentication token.

search_novel(query)

def search_novel(self, query: str) -> List[SearchResult]:
    """Gets a list of results matching the given query"""
    raise NotImplementedError()
Return up to ten SearchResult objects. Raise NotImplementedError (the default) if the site does not support search.
query
str
required
The search string entered by the user.
Returns: List[SearchResult]

Inherited from Scraper

These methods are available on every Crawler instance. They handle HTTP, cookie management, and HTML parsing.

HTTP helpers

get_response(url, timeout, **kwargs)
method
Fetch a URL with a GET request and return the raw Response object.
response = self.get_response("https://example.com/page")
  • url (str) — Target URL.
  • timeout (float | Tuple[float, float], default (7, 301)) — Connect and read timeouts in seconds.
get_soup(url, headers, encoding, **kwargs)
method
Fetch a URL and return a PageSoup instance for HTML parsing.
soup = self.get_soup(self.novel_url)
title = soup.select_one("h1").text
  • url (str) — Target URL.
  • headers (MutableMapping, optional) — Extra request headers.
  • encoding (str, optional) — Force a specific character encoding.
Returns: PageSoup
get_json(url, headers, **kwargs)
method
Fetch a URL and return the parsed JSON response.
data = self.get_json("https://example.com/api/novel/123")
  • url (str) — Target URL.
  • headers (MutableMapping, optional) — Extra request headers.
Returns: Any (parsed JSON)
post_json(url, data, headers, **kwargs)
method
POST to a URL and return the parsed JSON response.
data = self.post_json(
    "https://example.com/api/chapters",
    data={"novel_id": "42"},
)
  • url (str) — Target URL.
  • data (MutableMapping | str | bytes, optional) — Request body.
  • headers (MutableMapping, optional) — Extra request headers.
Returns: Any (parsed JSON)
post_soup(url, data, headers, encoding, **kwargs)
method
POST to a URL and return a PageSoup of the response.Returns: PageSoup
submit_form(url, data, multipart, headers, **kwargs)
method
Submit a form and return the raw Response.
  • url (str) — Form action URL.
  • data (MutableMapping | str | bytes, optional) — Form fields.
  • multipart (bool, default False) — Use multipart/form-data encoding.
Returns: Response

Utility helpers

absolute_url(url, page_url)
method
Resolve a relative URL against the current page or self.home_url.
full = self.absolute_url("/chapter/1")          # -> "https://example.com/chapter/1"
full = self.absolute_url("../other", page_url)  # resolves relative to page_url
  • url (Any) — The URL to resolve.
  • page_url (str, optional) — Base for relative resolution. Defaults to the last fetched soup URL.
Returns: str
make_soup(data, encoding)
method
Create a PageSoup from an already-fetched response, bytes, or string without making a new network request.Returns: PageSoup
set_header(key, value)
method
Set a default header that is sent with every subsequent request.
self.set_header("X-Api-Key", "secret")
Set a session cookie for subsequent requests.

Concurrency helpers (from TaskManager)

submit_task(fn, *args, **kwargs)
method
Submit a callable to the thread pool and return a Future. Use this inside read_novel_info() or download_chapter_body() to parallelise requests.
futures = [
    self.executor.submit(self.get_soup, chapter.url)
    for chapter in self.chapters
]
self.resolve_futures(futures, desc="Chapters", unit="ch")
Returns: Future[T]
resolve_futures(futures, desc, unit, disable_bar, fail_fast)
method
Wait for a list of Future objects to complete and return their results. Shows a progress bar by default.
  • futures (Iterable[Future]) — Futures to wait on.
  • desc (str, optional) — Progress bar label.
  • unit (str, optional) — Progress bar unit name.
  • disable_bar (bool, default False) — Suppress the progress bar.
  • fail_fast (bool, default False) — Raise on the first error instead of continuing.
Returns: List

Data model classes

Import these from lncrawl.models.

Chapter

from lncrawl.models import Chapter

chapter = Chapter(
    id=1,
    title="Chapter 1: The Beginning",
    url="https://example.com/chapter/1",
    volume=1,
)
FieldTypeDescription
idintRequired. 1-based chapter index.
urlstrURL where the chapter content is fetched from.
titlestrDisplay title of the chapter.
volumeOptional[int]The id of the volume this chapter belongs to.
bodyOptional[str]HTML content. Populated after download_chapter_body().
imagesDict[str, str]Map of image_id -> image_url for inline images.
successboolTrue once the chapter has been downloaded successfully.

Volume

from lncrawl.models import Volume

volume = Volume(id=1, title="Volume I: Awakening")
FieldTypeDescription
idintRequired. 1-based volume index.
titlestrDisplay title of the volume.
chapter_countintNumber of chapters. Computed automatically.

SearchResult

from lncrawl.models import SearchResult

result = SearchResult(
    title="The Legendary Hero",
    url="https://example.com/novel/the-legendary-hero",
    info="Chapters: 342 | Rating: 4.8",
)
FieldTypeDescription
titlestrRequired. Novel title.
urlstrRequired. URL to the novel’s table-of-contents page.
infostrShort metadata string shown in search results (e.g. chapter count, rating).

TextCleaner

Access via self.cleaner. Strips ads, normalises HTML, and returns clean paragraph-wrapped content.
body_html = self.cleaner.extract_contents(tag)  # returns clean <p>...</p> HTML
Customise in initialize():
AttributeTypeDescription
bad_tagsSet[str]HTML tag names to remove entirely (e.g. "aside", "figure").
bad_cssSet[str]CSS selectors whose matching elements are removed (e.g. ".ads").
bad_text_regexSet[str | Pattern]Regex patterns — paragraphs matching any pattern are dropped.
substitutionsDict[str, str]String replacements applied to text content.

Minimal example

import logging
from lncrawl.core import Crawler
from lncrawl.models import Chapter

logger = logging.getLogger(__name__)


class ExampleCrawler(Crawler):
    base_url = ["https://example.com/"]

    def read_novel_info(self) -> None:
        soup = self.get_soup(self.novel_url)
        self.novel_title = soup.select_one("h1.novel-title").text.strip()
        self.novel_cover = self.absolute_url(
            soup.select_one(".cover img")["src"]
        )
        chap_id = 0
        for a in soup.select(".chapter-list li a"):
            chap_id += 1
            self.chapters.append(Chapter(
                id=chap_id,
                title=a.text.strip(),
                url=self.absolute_url(a["href"]),
            ))

    def download_chapter_body(self, chapter: Chapter) -> str:
        soup = self.get_soup(chapter.url)
        body = soup.select_one("div.chapter-text")
        return self.cleaner.extract_contents(body)

Build docs developers (and LLMs) love