Skip to main content

Overview

Crawlith provides repository classes for structured access to database tables. Each repository encapsulates database queries and provides type-safe methods for common operations.

SiteRepository

Manages website configurations and metadata.

Interface

interface Site {
  id: number;
  domain: string;
  created_at: string;
  settings_json: string | null;
  is_active: number;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { SiteRepository } from '@crawlith/core/db';

const db = getDb();
const siteRepo = new SiteRepository(db);
db
Database
required
SQLite database instance from getDb()

getSite

Retrieve a site by domain.
const site = siteRepo.getSite('example.com');
if (site) {
  console.log(`Site ID: ${site.id}`);
}
domain
string
required
The domain name to look up
return
Site | undefined
The site record, or undefined if not found

getSiteById

Retrieve a site by its numeric ID.
const site = siteRepo.getSiteById(1);
id
number
required
The site’s numeric identifier
return
Site | undefined
The site record, or undefined if not found

getAllSites

Retrieve all sites, ordered by domain.
const allSites = siteRepo.getAllSites();
allSites.forEach(site => {
  console.log(site.domain);
});
return
Site[]
Array of all site records

createSite

Create a new site record.
const siteId = siteRepo.createSite('example.com');
domain
string
required
The domain name for the new site
return
number
The auto-incremented ID of the newly created site

firstOrCreateSite

Get an existing site or create it if it doesn’t exist.
const site = siteRepo.firstOrCreateSite('example.com');
// Always returns a valid Site object
domain
string
required
The domain name to retrieve or create
return
Site
The existing or newly created site record

deleteSite

Delete a site and all associated data (cascades to snapshots, pages, edges, metrics).
siteRepo.deleteSite(1);
id
number
required
The site ID to delete

SnapshotRepository

Manages crawl sessions and snapshot metadata.

Interface

interface Snapshot {
  id: number;
  site_id: number;
  type: 'full' | 'partial' | 'incremental';
  created_at: string;
  node_count: number;
  edge_count: number;
  status: 'running' | 'completed' | 'failed';
  limit_reached: number;
  health_score: number | null;
  orphan_count: number | null;
  thin_content_count: number | null;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { SnapshotRepository } from '@crawlith/core/db';

const db = getDb();
const snapshotRepo = new SnapshotRepository(db);
db
Database
required
SQLite database instance from getDb()

createSnapshot

Create a new crawl snapshot.
const snapshotId = snapshotRepo.createSnapshot(
  siteId,
  'full',
  'running'
);
siteId
number
required
The site ID this snapshot belongs to
type
'full' | 'partial' | 'incremental'
required
The snapshot type
status
'running' | 'completed' | 'failed'
default:"'running'"
Initial status of the snapshot
return
number
The auto-incremented ID of the newly created snapshot

getLatestSnapshot

Retrieve the most recent snapshot for a site.
const snapshot = snapshotRepo.getLatestSnapshot(
  siteId,
  'completed',
  false
);
siteId
number
required
The site ID to query
status
'running' | 'completed' | 'failed'
Optional status filter
includePartial
boolean
default:"false"
Whether to include partial snapshots in the search
return
Snapshot | undefined
The latest snapshot matching the criteria, or undefined

getSnapshot

Retrieve a snapshot by ID.
const snapshot = snapshotRepo.getSnapshot(123);
id
number
required
The snapshot ID
return
Snapshot | undefined
The snapshot record, or undefined if not found

getSnapshotCount

Get the total number of snapshots for a site.
const count = snapshotRepo.getSnapshotCount(siteId);
siteId
number
required
The site ID to count snapshots for
return
number
Total number of snapshots for the site

updateSnapshotStatus

Update a snapshot’s status and statistics.
snapshotRepo.updateSnapshotStatus(snapshotId, 'completed', {
  node_count: 150,
  edge_count: 450,
  health_score: 0.87,
  orphan_count: 5,
  thin_content_count: 12
});
id
number
required
The snapshot ID to update
status
'completed' | 'failed'
required
The new status
stats
Partial<Snapshot>
default:"{}"
Optional statistics to update (node_count, edge_count, limit_reached, health_score, orphan_count, thin_content_count)

deleteSnapshot

Delete a snapshot and clean up orphaned pages.
snapshotRepo.deleteSnapshot(snapshotId);
id
number
required
The snapshot ID to delete
This method runs in a transaction and will:
  1. Unlink pages from the snapshot
  2. Delete pages no longer referenced by any snapshot
  3. Delete the snapshot record
  4. Cascade delete edges and metrics via foreign key constraints

PageRepository

Manages discovered pages and their metadata.

Interface

interface Page {
  id: number;
  site_id: number;
  normalized_url: string;
  first_seen_snapshot_id: number | null;
  last_seen_snapshot_id: number | null;
  http_status: number | null;
  canonical_url: string | null;
  content_hash: string | null;
  simhash: string | null;
  etag: string | null;
  last_modified: string | null;
  html: string | null;
  soft404_score: number | null;
  noindex: number;
  nofollow: number;
  security_error: string | null;
  retries: number;
  depth: number;
  redirect_chain: string | null;
  bytes_received: number | null;
  crawl_trap_flag: number;
  crawl_trap_risk: number | null;
  trap_type: string | null;
  created_at: string;
  updated_at: string;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { PageRepository } from '@crawlith/core/db';

const db = getDb();
const pageRepo = new PageRepository(db);
db
Database
required
SQLite database instance from getDb()

upsertPage

Insert or update a page record.
pageRepo.upsertPage({
  site_id: 1,
  normalized_url: 'https://example.com/page',
  last_seen_snapshot_id: 123,
  http_status: 200,
  content_hash: 'abc123',
  depth: 1
});
page
Partial<Page>
required
Page data to insert or update. Must include site_id, normalized_url, and last_seen_snapshot_id.
On conflict (duplicate site_id + normalized_url), the method updates the existing record while preserving non-null values using COALESCE.

upsertAndGetId

Upsert a page and return its ID.
const pageId = pageRepo.upsertAndGetId({
  site_id: 1,
  normalized_url: 'https://example.com/page',
  last_seen_snapshot_id: 123,
  http_status: 200
});
page
Partial<Page>
required
Page data to insert or update
return
number
The page ID (existing or newly created)

upsertMany

Batch upsert multiple pages in a transaction.
const urlToIdMap = pageRepo.upsertMany([
  { site_id: 1, normalized_url: 'https://example.com/page1', last_seen_snapshot_id: 123 },
  { site_id: 1, normalized_url: 'https://example.com/page2', last_seen_snapshot_id: 123 }
]);

const page1Id = urlToIdMap.get('https://example.com/page1');
pages
Array<Partial<Page>>
required
Array of page records to upsert
return
Map<string, number>
Map of normalized URLs to their page IDs

getPage

Retrieve a page by site ID and URL.
const page = pageRepo.getPage(1, 'https://example.com/page');
siteId
number
required
The site ID
url
string
required
The normalized URL
return
Page | undefined
The page record, or undefined if not found

getPagesByUrls

Retrieve multiple pages by their URLs.
const pages = pageRepo.getPagesByUrls(1, [
  'https://example.com/page1',
  'https://example.com/page2'
]);
siteId
number
required
The site ID
urls
string[]
required
Array of normalized URLs
return
Page[]
Array of matching page records
This method handles large URL arrays by chunking them into batches of 900 to avoid SQLite parameter limits.

getPagesBySnapshot

Retrieve all pages visible in a snapshot.
const pages = pageRepo.getPagesBySnapshot(snapshotId);
snapshotId
number
required
The snapshot ID
return
Page[]
Array of all pages first seen on or before this snapshot

getPagesIdentityBySnapshot

Retrieve page IDs and URLs for a snapshot (lightweight query).
const pageIdentities = pageRepo.getPagesIdentityBySnapshot(snapshotId);
// [{ id: 1, normalized_url: 'https://example.com/page' }, ...]
snapshotId
number
required
The snapshot ID
return
Array<{ id: number, normalized_url: string }>
Array of page IDs and URLs

getPagesIteratorBySnapshot

Get an iterator for pages in a snapshot (memory-efficient).
for (const page of pageRepo.getPagesIteratorBySnapshot(snapshotId)) {
  console.log(page.normalized_url);
}
snapshotId
number
required
The snapshot ID
return
IterableIterator<Page>
Iterator for pages in the snapshot

getIdByUrl

Get a page ID by site and URL.
const pageId = pageRepo.getIdByUrl(1, 'https://example.com/page');
siteId
number
required
The site ID
url
string
required
The normalized URL
return
number | undefined
The page ID, or undefined if not found

EdgeRepository

Manages links between pages in the crawl graph.

Interface

interface Edge {
  id: number;
  snapshot_id: number;
  source_page_id: number;
  target_page_id: number;
  weight: number;
  rel: 'nofollow' | 'sponsored' | 'ugc' | 'internal' | 'external' | 'unknown';
}

Constructor

import { getDb } from '@crawlith/core/db';
import { EdgeRepository } from '@crawlith/core/db';

const db = getDb();
const edgeRepo = new EdgeRepository(db);
db
Database
required
SQLite database instance from getDb()

insertEdge

Insert a single edge between two pages.
edgeRepo.insertEdge(
  snapshotId,
  sourcePageId,
  targetPageId,
  1.0,
  'internal'
);
snapshotId
number
required
The snapshot this edge belongs to
sourcePageId
number
required
The page ID where the link originates
targetPageId
number
required
The page ID the link points to
weight
number
default:"1.0"
Link weight (used in PageRank calculations)
rel
string
default:"'internal'"
Link relationship: ‘nofollow’, ‘sponsored’, ‘ugc’, ‘internal’, ‘external’, ‘unknown’

insertEdges

Batch insert multiple edges in a transaction.
edgeRepo.insertEdges([
  { snapshot_id: 123, source_page_id: 1, target_page_id: 2, weight: 1.0, rel: 'internal' },
  { snapshot_id: 123, source_page_id: 1, target_page_id: 3, weight: 1.0, rel: 'internal' }
]);
edges
Array<Edge>
required
Array of edge records to insert

getEdgesBySnapshot

Retrieve all edges for a snapshot.
const edges = edgeRepo.getEdgesBySnapshot(snapshotId);
snapshotId
number
required
The snapshot ID
return
Edge[]
Array of all edges in the snapshot

getEdgesIteratorBySnapshot

Get an iterator for edges in a snapshot (memory-efficient).
for (const edge of edgeRepo.getEdgesIteratorBySnapshot(snapshotId)) {
  console.log(`${edge.source_page_id} -> ${edge.target_page_id}`);
}
snapshotId
number
required
The snapshot ID
return
IterableIterator<Edge>
Iterator for edges in the snapshot

MetricsRepository

Manages computed metrics for pages (PageRank, authority scores, duplicate detection).

Interface

interface DbMetrics {
  snapshot_id: number;
  page_id: number;
  authority_score: number | null;
  hub_score: number | null;
  pagerank: number | null;
  pagerank_score: number | null;
  link_role: 'hub' | 'authority' | 'power' | 'balanced' | 'peripheral' | null;
  crawl_status: string | null;
  word_count: number | null;
  thin_content_score: number | null;
  external_link_ratio: number | null;
  orphan_score: number | null;
  duplicate_cluster_id: string | null;
  duplicate_type: 'exact' | 'near' | 'template_heavy' | 'none' | null;
  is_cluster_primary: number;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { MetricsRepository } from '@crawlith/core/db';

const db = getDb();
const metricsRepo = new MetricsRepository(db);
db
Database
required
SQLite database instance from getDb()

insertMetrics

Insert or replace metrics for a page.
metricsRepo.insertMetrics({
  snapshot_id: 123,
  page_id: 1,
  pagerank: 0.85,
  authority_score: 0.92,
  hub_score: 0.45,
  link_role: 'authority',
  crawl_status: 'fetched',
  word_count: 1200,
  orphan_score: 0,
  duplicate_type: 'none',
  is_cluster_primary: 0
});
metrics
DbMetrics
required
Metrics record to insert or replace

insertMany

Batch insert metrics in a transaction.
metricsRepo.insertMany([
  { snapshot_id: 123, page_id: 1, pagerank: 0.85, ... },
  { snapshot_id: 123, page_id: 2, pagerank: 0.72, ... }
]);
metricsList
DbMetrics[]
required
Array of metrics records to insert

getMetrics

Retrieve all metrics for a snapshot.
const metrics = metricsRepo.getMetrics(snapshotId);
snapshotId
number
required
The snapshot ID
return
DbMetrics[]
Array of all metrics in the snapshot

getMetricsIterator

Get an iterator for metrics in a snapshot (memory-efficient).
for (const metric of metricsRepo.getMetricsIterator(snapshotId)) {
  console.log(`Page ${metric.page_id}: PageRank ${metric.pagerank}`);
}
snapshotId
number
required
The snapshot ID
return
IterableIterator<DbMetrics>
Iterator for metrics in the snapshot

getMetricsForPage

Retrieve metrics for a specific page in a snapshot.
const metrics = metricsRepo.getMetricsForPage(snapshotId, pageId);
if (metrics) {
  console.log(`PageRank: ${metrics.pagerank}`);
}
snapshotId
number
required
The snapshot ID
pageId
number
required
The page ID
return
DbMetrics | undefined
Metrics for the page, or undefined if not found

Example: Complete Workflow

import { getDb } from '@crawlith/core/db';
import {
  SiteRepository,
  SnapshotRepository,
  PageRepository,
  EdgeRepository,
  MetricsRepository
} from '@crawlith/core/db';

const db = getDb();

// Initialize repositories
const siteRepo = new SiteRepository(db);
const snapshotRepo = new SnapshotRepository(db);
const pageRepo = new PageRepository(db);
const edgeRepo = new EdgeRepository(db);
const metricsRepo = new MetricsRepository(db);

// Create or get site
const site = siteRepo.firstOrCreateSite('example.com');

// Create snapshot
const snapshotId = snapshotRepo.createSnapshot(site.id, 'full');

// Upsert pages
const urlToId = pageRepo.upsertMany([
  {
    site_id: site.id,
    normalized_url: 'https://example.com/',
    last_seen_snapshot_id: snapshotId,
    http_status: 200,
    depth: 0
  },
  {
    site_id: site.id,
    normalized_url: 'https://example.com/about',
    last_seen_snapshot_id: snapshotId,
    http_status: 200,
    depth: 1
  }
]);

// Insert edges
const homeId = urlToId.get('https://example.com/')!;
const aboutId = urlToId.get('https://example.com/about')!;

edgeRepo.insertEdges([
  {
    snapshot_id: snapshotId,
    source_page_id: homeId,
    target_page_id: aboutId,
    weight: 1.0,
    rel: 'internal'
  }
]);

// Insert metrics
metricsRepo.insertMany([
  {
    snapshot_id: snapshotId,
    page_id: homeId,
    pagerank: 0.85,
    authority_score: 0.92,
    hub_score: 0.45,
    link_role: 'authority',
    crawl_status: 'fetched',
    word_count: 1200,
    orphan_score: 0,
    duplicate_type: 'none',
    is_cluster_primary: 0
  }
]);

// Update snapshot status
snapshotRepo.updateSnapshotStatus(snapshotId, 'completed', {
  node_count: 2,
  edge_count: 1,
  health_score: 0.95
});

// Query results
const pages = pageRepo.getPagesBySnapshot(snapshotId);
const edges = edgeRepo.getEdgesBySnapshot(snapshotId);
const metrics = metricsRepo.getMetrics(snapshotId);

console.log(`Crawled ${pages.length} pages with ${edges.length} links`);

Database Overview

Learn about the SQLite schema, database location, and configuration

Build docs developers (and LLMs) love