Crawler. The Crawler class provides HTTP utilities, HTML parsing, concurrency helpers, and the contract your implementation must fulfil.
Inheritance chain: Crawler → Scraper → TaskManager
Source: lncrawl/core/crawler.py
Class definition
Class-level attributes
Declare these directly on the class body. They are read by the source service at import time.One or more URLs that identify the site this crawler handles. The first entry becomes
self.home_url at runtime. The source registry uses these to match an input URL to the correct crawler.Set to
True if this source serves manga, manhua, or manhwa (image-based content rather than text).Set to
True if this source serves machine-translated content.BCP-47 language code for the source language (e.g.
"en", "zh", "ja"). Leave empty for multi-language sources.When
True, the source is excluded from normal use. Set this if a site has shut down or permanently broken.Human-readable explanation of why the source is disabled. Only meaningful when
is_disabled is True.Instance attributes
These attributes are initialized in__init__ and are available throughout the crawler’s lifetime. You must populate the required ones inside read_novel_info().
Novel metadata
| Attribute | Type | Description |
|---|---|---|
novel_url | str | The current input URL. Available in search_novel() and read_novel_info(). |
novel_title | str | Required. The title of the novel. |
novel_author | str | Comma-separated list of author names. |
novel_cover | Optional[str] | Absolute URL to the cover image. |
is_rtl | bool | True for right-to-left text (e.g. Arabic, Hebrew). Default False. |
novel_synopsis | str | Synopsis or description of the novel. |
novel_tags | List[str] | Genre and tag names. |
Content lists
| Attribute | Type | Description |
|---|---|---|
volumes | List[Volume] | Ordered list of volumes. Can remain empty if the site has no volume grouping. |
chapters | List[Chapter] | Required. Ordered list of all chapters. |
Both
volumes and chapters must contain 1-based id values. The chapters list must be populated or the crawl will produce no output.Abstract methods
You must implement both of these in every crawler.read_novel_info()
self.novel_title and self.chapters.
Typical implementation:
download_chapter_body(chapter)
self.cleaner.extract_contents(tag) to strip ads and normalize markup.
The chapter to download. Use
chapter.url to fetch the page.str — Clean HTML content of the chapter body.
Typical implementation:
Optional methods
Override any of these to add capabilities or customise behaviour.initialize()
self.cleaner, adjust headers, or set rate limits.
login(username_or_email, password_or_token)
Username, email address, or API key depending on the site.
Password or authentication token.
search_novel(query)
SearchResult objects. Raise NotImplementedError (the default) if the site does not support search.
The search string entered by the user.
List[SearchResult]
Inherited from Scraper
These methods are available on everyCrawler instance. They handle HTTP, cookie management, and HTML parsing.
HTTP helpers
Fetch a URL with a GET request and return the raw
Response object.url(str) — Target URL.timeout(float | Tuple[float, float], default(7, 301)) — Connect and read timeouts in seconds.
Fetch a URL and return a
PageSoup instance for HTML parsing.url(str) — Target URL.headers(MutableMapping, optional) — Extra request headers.encoding(str, optional) — Force a specific character encoding.
PageSoupFetch a URL and return the parsed JSON response.
url(str) — Target URL.headers(MutableMapping, optional) — Extra request headers.
Any (parsed JSON)POST to a URL and return the parsed JSON response.
url(str) — Target URL.data(MutableMapping | str | bytes, optional) — Request body.headers(MutableMapping, optional) — Extra request headers.
Any (parsed JSON)POST to a URL and return a
PageSoup of the response.Returns: PageSoupSubmit a form and return the raw
Response.url(str) — Form action URL.data(MutableMapping | str | bytes, optional) — Form fields.multipart(bool, defaultFalse) — Usemultipart/form-dataencoding.
ResponseUtility helpers
Resolve a relative URL against the current page or
self.home_url.url(Any) — The URL to resolve.page_url(str, optional) — Base for relative resolution. Defaults to the last fetched soup URL.
strCreate a
PageSoup from an already-fetched response, bytes, or string without making a new network request.Returns: PageSoupSet a default header that is sent with every subsequent request.
Set a session cookie for subsequent requests.
Concurrency helpers (from TaskManager)
Submit a callable to the thread pool and return a Returns:
Future. Use this inside read_novel_info() or download_chapter_body() to parallelise requests.Future[T]Wait for a list of
Future objects to complete and return their results. Shows a progress bar by default.futures(Iterable[Future]) — Futures to wait on.desc(str, optional) — Progress bar label.unit(str, optional) — Progress bar unit name.disable_bar(bool, defaultFalse) — Suppress the progress bar.fail_fast(bool, defaultFalse) — Raise on the first error instead of continuing.
ListData model classes
Import these fromlncrawl.models.
Chapter
| Field | Type | Description |
|---|---|---|
id | int | Required. 1-based chapter index. |
url | str | URL where the chapter content is fetched from. |
title | str | Display title of the chapter. |
volume | Optional[int] | The id of the volume this chapter belongs to. |
body | Optional[str] | HTML content. Populated after download_chapter_body(). |
images | Dict[str, str] | Map of image_id -> image_url for inline images. |
success | bool | True once the chapter has been downloaded successfully. |
Volume
| Field | Type | Description |
|---|---|---|
id | int | Required. 1-based volume index. |
title | str | Display title of the volume. |
chapter_count | int | Number of chapters. Computed automatically. |
SearchResult
| Field | Type | Description |
|---|---|---|
title | str | Required. Novel title. |
url | str | Required. URL to the novel’s table-of-contents page. |
info | str | Short metadata string shown in search results (e.g. chapter count, rating). |
TextCleaner
Access viaself.cleaner. Strips ads, normalises HTML, and returns clean paragraph-wrapped content.
initialize():
| Attribute | Type | Description |
|---|---|---|
bad_tags | Set[str] | HTML tag names to remove entirely (e.g. "aside", "figure"). |
bad_css | Set[str] | CSS selectors whose matching elements are removed (e.g. ".ads"). |
bad_text_regex | Set[str | Pattern] | Regex patterns — paragraphs matching any pattern are dropped. |
substitutions | Dict[str, str] | String replacements applied to text content. |