Skip to main content
A source crawler is a small Python file that tells Lightnovel Crawler how to read a specific novel website. It answers two questions:
  1. What’s on the novel page? — title, author, cover, list of chapters.
  2. What’s inside each chapter? — the story text, cleaned of ads and navigation.
Once you add a crawler, users can download novels from that site through the CLI or the web UI.

Prerequisites

  • Python 3.9+
  • The project set up locally (see Development setup)
  • Basic familiarity with HTML and CSS selectors — you’ll use selectors like div.chapter-content to target elements

Architecture overview

Crawlers live in the sources/ directory, organized by language. Each crawler is a single .py file containing one class that inherits from a template base class. The template handles HTTP requests, concurrency, and output building — you only implement the methods that extract data from the page HTML. The crawler registry auto-discovers all Python files in sources/ on startup. A crawler is matched to a URL when the URL starts with one of the values in the class’s base_url list.

Choose a template

Pick the template that matches your site’s structure. Start with GeneralSoupTemplate for most sites and only move to a more specific template if you need its extra features.

File placement

Crawlers are grouped by the site’s language. Place your file in the matching directory:
Site languageFolderExample path
Englishsources/en/ then by first lettersources/en/m/mysite.py
Chinesesources/zh/sources/zh/mysite.py
Japanesesources/ja/sources/ja/mysite.py
Multiple languagessources/multi/sources/multi/mysite.py
For English sites, use a letter subfolder based on the site’s domain name (e.g. “mynovelsite.com” → sources/en/m/). Name the file after the site, such as mynovelsite.py. Avoid generic names like crawler.py.

Step-by-step: build a crawler

1

Copy the example file

Copy the appropriate example into the right sources/{lang}/ folder:
cp sources/_examples/_01_general_soup.py sources/en/m/mysite.py
For English sites, use the letter subfolder matching your domain’s first letter.
2

Set base_url and rename the class

Open your new file and update the class name and base_url:
class MySiteCrawler(GeneralSoupTemplate):
    base_url = ["https://mysite.com/", "https://www.mysite.com/"]
base_url is a list of URL prefixes. When a user pastes a novel URL, the app finds your crawler by checking if the URL starts with one of these values.
3

Implement parse_title

Open the novel page in your browser, right-click the title, and select Inspect. Find the CSS selector that uniquely identifies the title element.
def parse_title(self, soup: PageSoup) -> str:
    tag = soup.select_one("h1.novel-title")  # adjust selector to match site
    return tag.get_text(strip=True) if tag else ""
The soup parameter is the parsed HTML of the novel’s detail page.
4

Implement parse_cover

Find the cover image element and return its URL. Use self.absolute_url() to handle relative paths.
def parse_cover(self, soup: PageSoup) -> Optional[str]:
    img = soup.select_one("img.cover")  # adjust selector
    if img and img.get("src"):
        return self.absolute_url(img["src"])
    return None
Return None if the site has no cover image.
5

Implement parse_chapter_list

Yield Volume and Chapter objects in order. The template appends them to self.volumes and self.chapters.Flat chapter list (no explicit volumes):
from lncrawl.models import Chapter, Volume

def parse_chapter_list(
    self, soup: PageSoup
) -> Generator[Union[Chapter, Volume], None, None]:
    yield Volume(id=1, title="Volume 1")

    for idx, a in enumerate(soup.select("ul.chapters a"), 1):
        yield Chapter(
            id=idx,
            title=a.get_text(strip=True),
            url=self.absolute_url(a["href"]),
            volume=1,
        )
With volume headings:
def parse_chapter_list(
    self, soup: PageSoup
) -> Generator[Union[Chapter, Volume], None, None]:
    chap_id = 1
    for vol_id, vol_tag in enumerate(soup.select(".volume-block"), 1):
        yield Volume(id=vol_id, title=vol_tag.select_one(".vol-title").get_text(strip=True))
        for a in vol_tag.select("a.chapter-link"):
            yield Chapter(
                id=chap_id,
                title=a.get_text(strip=True),
                url=self.absolute_url(a["href"]),
                volume=vol_id,
            )
            chap_id += 1
Always use self.absolute_url() on chapter URLs — some sites use relative paths.
6

Implement select_chapter_body

The template fetches each chapter page and calls this method with its parsed HTML. Return the single tag that wraps the story text.
def select_chapter_body(self, soup: PageSoup) -> PageSoup:
    return soup.select_one("div.chapter-content")  # adjust selector
The template extracts and cleans the content automatically. Return None if not found.
7

Test with a real novel URL

Run a quick download test from the project root. --first 3 downloads only the first 3 chapters; -f json outputs as JSON:
uv run python -m lncrawl -s "https://mysite.com/novel/example" --first 3 -f json
If you see errors about selectors or “element not found”, your CSS selectors don’t match the site — use the browser’s Inspect tool to find the correct class names.
8

Verify the crawler is registered

Confirm your file appears in the source list:
uv run python -m lncrawl sources list | grep mysite
If nothing appears, check that the file is in the right sources/ folder and that the class inherits from Crawler or a template.

Required method signatures

These are the four methods every GeneralSoupTemplate crawler must implement:
def parse_title(self, soup: PageSoup) -> str:
    """Return the novel title from the novel detail page."""
    ...

def parse_cover(self, soup: PageSoup) -> str:
    """Return the cover image URL, or '' if none.
    Use self.absolute_url() for relative paths.
    """
    ...

def parse_chapter_list(
    self, soup: PageSoup
) -> Generator[Union[Chapter, Volume], None, None]:
    """Yield Volume and Chapter objects in order.
    The template appends them to self.volumes and self.chapters.
    """
    ...

def select_chapter_body(self, soup: PageSoup) -> PageSoup:
    """Return the Tag containing the chapter text.
    The template cleans and extracts it.
    Return None if not found.
    """
    ...

Optional methods

Override these in your class when needed. The template provides working defaults for all of them.
MethodPurpose
get_novel_soup(self)Return the BeautifulSoup for the novel page. Override if you need a different URL or POST request.
parse_authors(self, soup)Yield author name strings. Default: yields nothing.
parse_genres(self, soup)Yield genre or tag strings. Default: yields nothing.
parse_summary(self, soup)Return the novel synopsis string. Default: "".
initialize(self)One-time setup — configure cleaner rules, set custom headers.
login(self, username_or_email, password_or_token)Log in before scraping, for sites that require authentication.
Example — parse_authors:
def parse_authors(self, soup: PageSoup) -> Generator[str, None, None]:
    tag = soup.find("strong", string="Author:")
    if tag and tag.next_sibling:
        yield tag.next_sibling.get_text(strip=True)
    # For multiple authors:
    # for a in soup.select(".author a"):
    #     yield a.get_text(strip=True)
Example — initialize with custom cleaner rules:
def initialize(self) -> None:
    self.cleaner.bad_css.update(["div.advertisement", "div.social-share"])
    self.cleaner.bad_tags.update(["script", "style"])

Available helper methods

These are inherited from the base template and available inside any method:

HTTP and parsing

MethodDescription
self.get_soup(url)GET request, returns PageSoup
self.post_soup(url, data)POST request, returns PageSoup
self.get_json(url)GET request, returns parsed JSON
self.post_json(url, data)POST request, returns parsed JSON
self.submit_form(url, data)Submit form data

URLs

MethodDescription
self.absolute_url(path)Convert a relative path like /chapter/1 to a full URL
self.novel_urlThe novel page URL the user provided
self.home_urlThe first value in base_url

Content cleaning

# Remove specific CSS selectors before extraction
self.cleaner.bad_css.update(["div.ads", "span.watermark"])

# Remove specific tags
self.cleaner.bad_tags.update(["script", "style"])

# Extract clean HTML from a tag
html = self.cleaner.extract_contents(soup_element)

Using ChatGPT to generate a crawler

The CLI includes a command that uses ChatGPT to generate a crawler from a novel URL:
lncrawl sources create
This is a good starting point, but the generated code will need review and testing before it is ready to submit.

Known engine templates

If the site runs on a widely-used novel platform, you may be able to inherit from an existing engine template and only override base_url:
  • lncrawl.templates.madara — Madara WordPress theme
  • lncrawl.templates.novelfull — NovelFull-style sites
  • lncrawl.templates.novelpub — NovelPub-style sites
Check existing crawlers in sources/ that use these templates for reference.

Best practices

  • Handle missing elements — not every novel has a cover or author. Always use if tag: before accessing attributes.
  • Log useful infologger.info("Found %d chapters", len(self.chapters)) makes debugging much easier.
  • Use self.absolute_url() — for all chapter URLs and the cover image, to ensure they resolve correctly from any context.
  • Test edge cases — try a novel with many chapters, one with special characters in the title, and one with no cover.
  • Clean the chapter content — configure self.cleaner to strip ads, navigation links, and scripts so the exported ebook looks clean.
  • Respect the site — don’t send too many requests at once; the base app already limits concurrency.

Common mistakes

  • Wrong selectors — the site’s HTML may use different class names from what you expect. Always inspect the live page in your browser.
  • Relative URLs — always use self.absolute_url(link["href"]) for chapter URLs and cover images.
  • Unimplemented required methodsparse_title, parse_cover, parse_chapter_list, and select_chapter_body must all return real data, or the app will do nothing.

Complete example

A full working crawler using GeneralSoupTemplate:
# -*- coding: utf-8 -*-
import logging
from typing import Generator, Union

from lncrawl.core import PageSoup
from lncrawl.models import Chapter, Volume
from lncrawl.templates.soup.general import GeneralSoupTemplate

logger = logging.getLogger(__name__)


class ExampleCrawler(GeneralSoupTemplate):
    base_url = ["https://example-novel-site.com/"]

    def initialize(self) -> None:
        self.cleaner.bad_css.update(["div.advertisement", "div.social-share"])

    def parse_title(self, soup: PageSoup) -> str:
        tag = soup.select_one("h1.novel-title")
        return tag.get_text(strip=True) if tag else ""

    def parse_cover(self, soup: PageSoup) -> str:
        img = soup.select_one("img.novel-cover")
        if img and img.get("src"):
            return self.absolute_url(img["src"])
        return ""

    def parse_authors(self, soup: PageSoup) -> Generator[str, None, None]:
        author = soup.select_one("span.author-name")
        if author:
            yield author.get_text(strip=True)

    def parse_chapter_list(
        self, soup: PageSoup
    ) -> Generator[Union[Chapter, Volume], None, None]:
        yield Volume(id=1, title="Volume 1")
        links = soup.select("ul.chapter-list a")
        for idx, a in enumerate(links, 1):
            yield Chapter(
                id=idx,
                title=a.get_text(strip=True),
                url=self.absolute_url(a["href"]),
                volume=1,
            )
        logger.info("Found %d chapters", len(links))

    def select_chapter_body(self, soup: PageSoup) -> PageSoup:
        return soup.select_one("div.chapter-content")
Once everything works, open a pull request to the main repository.

Build docs developers (and LLMs) love