- What’s on the novel page? — title, author, cover, list of chapters.
- What’s inside each chapter? — the story text, cleaned of ads and navigation.
Prerequisites
- Python 3.9+
- The project set up locally (see Development setup)
- Basic familiarity with HTML and CSS selectors — you’ll use selectors like
div.chapter-contentto target elements
Architecture overview
Crawlers live in thesources/ directory, organized by language. Each crawler is a single .py file containing one class that inherits from a template base class. The template handles HTTP requests, concurrency, and output building — you only implement the methods that extract data from the page HTML.
The crawler registry auto-discovers all Python files in sources/ on startup. A crawler is matched to a URL when the URL starts with one of the values in the class’s base_url list.
Choose a template
Pick the template that matches your site’s structure. Start withGeneralSoupTemplate for most sites and only move to a more specific template if you need its extra features.
- GeneralSoupTemplate (recommended)
- SearchableSoupTemplate
- With volumes
- Browser-based
- Base Crawler
File: Use this template when the site renders HTML server-side and has a straightforward chapter list.
sources/_examples/_01_general_soup.pyThe default choice for most sites. Requires four methods and handles everything else.File placement
Crawlers are grouped by the site’s language. Place your file in the matching directory:| Site language | Folder | Example path |
|---|---|---|
| English | sources/en/ then by first letter | sources/en/m/mysite.py |
| Chinese | sources/zh/ | sources/zh/mysite.py |
| Japanese | sources/ja/ | sources/ja/mysite.py |
| Multiple languages | sources/multi/ | sources/multi/mysite.py |
sources/en/m/). Name the file after the site, such as mynovelsite.py. Avoid generic names like crawler.py.
Step-by-step: build a crawler
Copy the example file
Copy the appropriate example into the right For English sites, use the letter subfolder matching your domain’s first letter.
sources/{lang}/ folder:Set base_url and rename the class
Open your new file and update the class name and
base_url:base_url is a list of URL prefixes. When a user pastes a novel URL, the app finds your crawler by checking if the URL starts with one of these values.Implement parse_title
Open the novel page in your browser, right-click the title, and select Inspect. Find the CSS selector that uniquely identifies the title element.The
soup parameter is the parsed HTML of the novel’s detail page.Implement parse_cover
Find the cover image element and return its URL. Use Return
self.absolute_url() to handle relative paths.None if the site has no cover image.Implement parse_chapter_list
Yield With volume headings:Always use
Volume and Chapter objects in order. The template appends them to self.volumes and self.chapters.Flat chapter list (no explicit volumes):self.absolute_url() on chapter URLs — some sites use relative paths.Implement select_chapter_body
The template fetches each chapter page and calls this method with its parsed HTML. Return the single tag that wraps the story text.The template extracts and cleans the content automatically. Return
None if not found.Test with a real novel URL
Run a quick download test from the project root. If you see errors about selectors or “element not found”, your CSS selectors don’t match the site — use the browser’s Inspect tool to find the correct class names.
--first 3 downloads only the first 3 chapters; -f json outputs as JSON:Required method signatures
These are the four methods everyGeneralSoupTemplate crawler must implement:
Optional methods
Override these in your class when needed. The template provides working defaults for all of them.| Method | Purpose |
|---|---|
get_novel_soup(self) | Return the BeautifulSoup for the novel page. Override if you need a different URL or POST request. |
parse_authors(self, soup) | Yield author name strings. Default: yields nothing. |
parse_genres(self, soup) | Yield genre or tag strings. Default: yields nothing. |
parse_summary(self, soup) | Return the novel synopsis string. Default: "". |
initialize(self) | One-time setup — configure cleaner rules, set custom headers. |
login(self, username_or_email, password_or_token) | Log in before scraping, for sites that require authentication. |
Available helper methods
These are inherited from the base template and available inside any method:HTTP and parsing
| Method | Description |
|---|---|
self.get_soup(url) | GET request, returns PageSoup |
self.post_soup(url, data) | POST request, returns PageSoup |
self.get_json(url) | GET request, returns parsed JSON |
self.post_json(url, data) | POST request, returns parsed JSON |
self.submit_form(url, data) | Submit form data |
URLs
| Method | Description |
|---|---|
self.absolute_url(path) | Convert a relative path like /chapter/1 to a full URL |
self.novel_url | The novel page URL the user provided |
self.home_url | The first value in base_url |
Content cleaning
Using ChatGPT to generate a crawler
The CLI includes a command that uses ChatGPT to generate a crawler from a novel URL:Known engine templates
If the site runs on a widely-used novel platform, you may be able to inherit from an existing engine template and only overridebase_url:
lncrawl.templates.madara— Madara WordPress themelncrawl.templates.novelfull— NovelFull-style siteslncrawl.templates.novelpub— NovelPub-style sites
sources/ that use these templates for reference.
Best practices
- Handle missing elements — not every novel has a cover or author. Always use
if tag:before accessing attributes. - Log useful info —
logger.info("Found %d chapters", len(self.chapters))makes debugging much easier. - Use
self.absolute_url()— for all chapter URLs and the cover image, to ensure they resolve correctly from any context. - Test edge cases — try a novel with many chapters, one with special characters in the title, and one with no cover.
- Clean the chapter content — configure
self.cleanerto strip ads, navigation links, and scripts so the exported ebook looks clean. - Respect the site — don’t send too many requests at once; the base app already limits concurrency.
Common mistakes
Complete example
A full working crawler usingGeneralSoupTemplate: