Skip to main content

Overview

Crawlith uses SQLite for persistent storage of crawl data, site configurations, and computed metrics. The database provides a robust foundation for tracking website snapshots over time, analyzing link structures, and identifying content issues.

Database Location

By default, Crawlith stores its database at:
~/.crawlith/crawlith.db
You can customize the database location using the CRAWLITH_DB_PATH environment variable:
export CRAWLITH_DB_PATH=/path/to/custom/crawlith.db
For testing, the database can be configured to use an in-memory instance by setting NODE_ENV=test.

Database Configuration

The database is initialized with the following SQLite pragmas for optimal performance and data integrity:
journal_mode
WAL
Write-Ahead Logging mode for better concurrency and performance
synchronous
NORMAL
Balanced durability and performance
foreign_keys
ON
Enforces referential integrity between tables
temp_store
MEMORY
Stores temporary tables and indices in memory
mmap_size
30000000000
Memory-mapped I/O for faster reads (30GB)
cache_size
-20000
Page cache size (20MB)
busy_timeout
5000
Wait up to 5 seconds when database is locked

Schema Overview

The database consists of seven primary tables:

Sites Table

Stores website configurations and metadata.
CREATE TABLE sites (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  domain TEXT UNIQUE NOT NULL,
  created_at TEXT DEFAULT (datetime('now')),
  settings_json TEXT,
  is_active INTEGER DEFAULT 1
);

Snapshots Table

Tracks individual crawl sessions for each site.
CREATE TABLE snapshots (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  site_id INTEGER NOT NULL,
  type TEXT CHECK(type IN ('full', 'partial', 'incremental')) NOT NULL,
  created_at TEXT DEFAULT (datetime('now')),
  node_count INTEGER DEFAULT 0,
  edge_count INTEGER DEFAULT 0,
  status TEXT CHECK(status IN ('running', 'completed', 'failed')) DEFAULT 'running',
  limit_reached INTEGER DEFAULT 0,
  health_score REAL,
  orphan_count INTEGER,
  thin_content_count INTEGER,
  FOREIGN KEY(site_id) REFERENCES sites(id) ON DELETE CASCADE
);

Pages Table

Stores discovered pages with metadata, content hashes, and crawl information.
CREATE TABLE pages (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  site_id INTEGER NOT NULL,
  normalized_url TEXT NOT NULL,
  first_seen_snapshot_id INTEGER,
  last_seen_snapshot_id INTEGER,
  http_status INTEGER,
  canonical_url TEXT,
  content_hash TEXT,
  simhash TEXT,
  etag TEXT,
  last_modified TEXT,
  html TEXT,
  soft404_score REAL,
  noindex INTEGER DEFAULT 0,
  nofollow INTEGER DEFAULT 0,
  security_error TEXT,
  retries INTEGER DEFAULT 0,
  depth INTEGER DEFAULT 0,
  redirect_chain TEXT,
  bytes_received INTEGER,
  crawl_trap_flag INTEGER DEFAULT 0,
  crawl_trap_risk REAL,
  trap_type TEXT,
  created_at TEXT DEFAULT (datetime('now')),
  updated_at TEXT DEFAULT (datetime('now')),
  FOREIGN KEY(site_id) REFERENCES sites(id) ON DELETE CASCADE,
  FOREIGN KEY(first_seen_snapshot_id) REFERENCES snapshots(id),
  FOREIGN KEY(last_seen_snapshot_id) REFERENCES snapshots(id),
  UNIQUE(site_id, normalized_url)
);

Edges Table

Represents links between pages in the crawl graph.
CREATE TABLE edges (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  snapshot_id INTEGER NOT NULL,
  source_page_id INTEGER NOT NULL,
  target_page_id INTEGER NOT NULL,
  weight REAL DEFAULT 1.0,
  rel TEXT CHECK(rel IN ('nofollow', 'sponsored', 'ugc', 'internal', 'external', 'unknown')) DEFAULT 'internal',
  FOREIGN KEY(snapshot_id) REFERENCES snapshots(id) ON DELETE CASCADE,
  FOREIGN KEY(source_page_id) REFERENCES pages(id) ON DELETE CASCADE,
  FOREIGN KEY(target_page_id) REFERENCES pages(id) ON DELETE CASCADE
);

Metrics Table

Stores computed metrics for each page in a snapshot (PageRank, authority scores, etc.).
CREATE TABLE metrics (
  snapshot_id INTEGER NOT NULL,
  page_id INTEGER NOT NULL,
  authority_score REAL,
  hub_score REAL,
  pagerank REAL,
  pagerank_score REAL,
  link_role TEXT CHECK(link_role IN ('hub', 'authority', 'power', 'balanced', 'peripheral')),
  crawl_status TEXT,
  word_count INTEGER,
  thin_content_score REAL,
  external_link_ratio REAL,
  orphan_score INTEGER,
  duplicate_cluster_id TEXT,
  duplicate_type TEXT CHECK(duplicate_type IN ('exact', 'near', 'template_heavy', 'none')),
  is_cluster_primary INTEGER DEFAULT 0,
  PRIMARY KEY(snapshot_id, page_id),
  FOREIGN KEY(snapshot_id) REFERENCES snapshots(id) ON DELETE CASCADE,
  FOREIGN KEY(page_id) REFERENCES pages(id) ON DELETE CASCADE
);

Duplicate Clusters Table

Tracks groups of duplicate or near-duplicate content.
CREATE TABLE duplicate_clusters (
  id TEXT NOT NULL,
  snapshot_id INTEGER NOT NULL,
  type TEXT CHECK(type IN ('exact', 'near', 'template_heavy')) NOT NULL,
  size INTEGER NOT NULL,
  representative TEXT NOT NULL,
  severity TEXT CHECK(severity IN ('low', 'medium', 'high')) NOT NULL,
  PRIMARY KEY(snapshot_id, id),
  FOREIGN KEY(snapshot_id) REFERENCES snapshots(id) ON DELETE CASCADE
);

Content Clusters Table

Groups pages with similar content patterns.
CREATE TABLE content_clusters (
  id INTEGER NOT NULL,
  snapshot_id INTEGER NOT NULL,
  count INTEGER NOT NULL,
  primary_url TEXT NOT NULL,
  risk TEXT CHECK(risk IN ('low', 'medium', 'high')) NOT NULL,
  shared_path_prefix TEXT,
  PRIMARY KEY(snapshot_id, id),
  FOREIGN KEY(snapshot_id) REFERENCES snapshots(id) ON DELETE CASCADE
);

Accessing the Database

Use the getDb() function to access the database instance:
import { getDb } from '@crawlith/core/db';

const db = getDb();
getDb
Database
Returns the singleton database instance, initializing it if necessary

Database Functions

getDbPath

Returns the path to the database file.
import { getDbPath } from '@crawlith/core/db';

const path = getDbPath();
// Returns: ~/.crawlith/crawlith.db (or custom path)
getDbPath
string
Returns the absolute path to the database file

closeDb

Closes the database connection.
import { closeDb } from '@crawlith/core/db';

closeDb();

Security Features

  • File Permissions: Database directory is created with 0o700 (user-only access)
  • Database File: Set to 0o600 (user read/write only)
  • Integrity Checks: Automatic integrity check on database initialization
  • Foreign Keys: Enabled to maintain referential integrity

Examples

Initialize Database

import { getDb, initSchema } from '@crawlith/core/db';

const db = getDb();
initSchema(db);

Query Database Directly

import { getDb } from '@crawlith/core/db';

const db = getDb();
const sites = db.prepare('SELECT * FROM sites WHERE is_active = 1').all();
console.log(sites);

Use Repositories

For structured access to database tables, use the repository classes:
import { getDb } from '@crawlith/core/db';
import { SiteRepository, SnapshotRepository } from '@crawlith/core/db';

const db = getDb();
const siteRepo = new SiteRepository(db);
const snapshotRepo = new SnapshotRepository(db);

const site = siteRepo.firstOrCreateSite('example.com');
const snapshotId = snapshotRepo.createSnapshot(site.id, 'full');

Repositories

Explore SiteRepository, SnapshotRepository, PageRepository, EdgeRepository, and MetricsRepository

Build docs developers (and LLMs) love