Repository Classes

Overview

Crawlith provides repository classes for structured access to database tables. Each repository encapsulates database queries and provides type-safe methods for common operations.

SiteRepository

Manages website configurations and metadata.

Interface

interface Site {
  id: number;
  domain: string;
  created_at: string;
  settings_json: string | null;
  is_active: number;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { SiteRepository } from '@crawlith/core/db';

const db = getDb();
const siteRepo = new SiteRepository(db);

Database

required

SQLite database instance from getDb()

getSite

Retrieve a site by domain.

const site = siteRepo.getSite('example.com');
if (site) {
  console.log(`Site ID: ${site.id}`);
}

domain

string

required

The domain name to look up

return

Site | undefined

The site record, or undefined if not found

getSiteById

Retrieve a site by its numeric ID.

const site = siteRepo.getSiteById(1);

number

required

The site’s numeric identifier

return

Site | undefined

The site record, or undefined if not found

getAllSites

Retrieve all sites, ordered by domain.

const allSites = siteRepo.getAllSites();
allSites.forEach(site => {
  console.log(site.domain);
});

return

Site[]

Array of all site records

createSite

Create a new site record.

const siteId = siteRepo.createSite('example.com');

domain

string

required

The domain name for the new site

return

number

The auto-incremented ID of the newly created site

firstOrCreateSite

Get an existing site or create it if it doesn’t exist.

const site = siteRepo.firstOrCreateSite('example.com');
// Always returns a valid Site object

domain

string

required

The domain name to retrieve or create

return

Site

The existing or newly created site record

deleteSite

Delete a site and all associated data (cascades to snapshots, pages, edges, metrics).

siteRepo.deleteSite(1);

number

required

The site ID to delete

SnapshotRepository

Manages crawl sessions and snapshot metadata.

Interface

interface Snapshot {
  id: number;
  site_id: number;
  type: 'full' | 'partial' | 'incremental';
  created_at: string;
  node_count: number;
  edge_count: number;
  status: 'running' | 'completed' | 'failed';
  limit_reached: number;
  health_score: number | null;
  orphan_count: number | null;
  thin_content_count: number | null;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { SnapshotRepository } from '@crawlith/core/db';

const db = getDb();
const snapshotRepo = new SnapshotRepository(db);

Database

required

SQLite database instance from getDb()

createSnapshot

Create a new crawl snapshot.

const snapshotId = snapshotRepo.createSnapshot(
  siteId,
  'full',
  'running'
);

siteId

number

required

The site ID this snapshot belongs to

type

'full' | 'partial' | 'incremental'

required

The snapshot type

status

'running' | 'completed' | 'failed'

default:"'running'"

Initial status of the snapshot

return

number

The auto-incremented ID of the newly created snapshot

getLatestSnapshot

Retrieve the most recent snapshot for a site.

const snapshot = snapshotRepo.getLatestSnapshot(
  siteId,
  'completed',
  false
);

siteId

number

required

The site ID to query

status

'running' | 'completed' | 'failed'

Optional status filter

includePartial

boolean

default:"false"

Whether to include partial snapshots in the search

return

Snapshot | undefined

The latest snapshot matching the criteria, or undefined

getSnapshot

Retrieve a snapshot by ID.

const snapshot = snapshotRepo.getSnapshot(123);

number

required

The snapshot ID

return

Snapshot | undefined

The snapshot record, or undefined if not found

getSnapshotCount

Get the total number of snapshots for a site.

const count = snapshotRepo.getSnapshotCount(siteId);

siteId

number

required

The site ID to count snapshots for

return

number

Total number of snapshots for the site

updateSnapshotStatus

Update a snapshot’s status and statistics.

snapshotRepo.updateSnapshotStatus(snapshotId, 'completed', {
  node_count: 150,
  edge_count: 450,
  health_score: 0.87,
  orphan_count: 5,
  thin_content_count: 12
});

number

required

The snapshot ID to update

status

'completed' | 'failed'

required

The new status

stats

Partial<Snapshot>

default:"{}"

Optional statistics to update (node_count, edge_count, limit_reached, health_score, orphan_count, thin_content_count)

deleteSnapshot

Delete a snapshot and clean up orphaned pages.

snapshotRepo.deleteSnapshot(snapshotId);

number

required

The snapshot ID to delete

This method runs in a transaction and will:

Unlink pages from the snapshot
Delete pages no longer referenced by any snapshot
Delete the snapshot record
Cascade delete edges and metrics via foreign key constraints

PageRepository

Manages discovered pages and their metadata.

Interface

interface Page {
  id: number;
  site_id: number;
  normalized_url: string;
  first_seen_snapshot_id: number | null;
  last_seen_snapshot_id: number | null;
  http_status: number | null;
  canonical_url: string | null;
  content_hash: string | null;
  simhash: string | null;
  etag: string | null;
  last_modified: string | null;
  html: string | null;
  soft404_score: number | null;
  noindex: number;
  nofollow: number;
  security_error: string | null;
  retries: number;
  depth: number;
  redirect_chain: string | null;
  bytes_received: number | null;
  crawl_trap_flag: number;
  crawl_trap_risk: number | null;
  trap_type: string | null;
  created_at: string;
  updated_at: string;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { PageRepository } from '@crawlith/core/db';

const db = getDb();
const pageRepo = new PageRepository(db);

Database

required

SQLite database instance from getDb()

upsertPage

Insert or update a page record.

pageRepo.upsertPage({
  site_id: 1,
  normalized_url: 'https://example.com/page',
  last_seen_snapshot_id: 123,
  http_status: 200,
  content_hash: 'abc123',
  depth: 1
});

page

Partial<Page>

required

Page data to insert or update. Must include site_id, normalized_url, and last_seen_snapshot_id.

On conflict (duplicate site_id + normalized_url), the method updates the existing record while preserving non-null values using COALESCE.

upsertAndGetId

Upsert a page and return its ID.

const pageId = pageRepo.upsertAndGetId({
  site_id: 1,
  normalized_url: 'https://example.com/page',
  last_seen_snapshot_id: 123,
  http_status: 200
});

page

Partial<Page>

required

Page data to insert or update

return

number

The page ID (existing or newly created)

upsertMany

Batch upsert multiple pages in a transaction.

const urlToIdMap = pageRepo.upsertMany([
  { site_id: 1, normalized_url: 'https://example.com/page1', last_seen_snapshot_id: 123 },
  { site_id: 1, normalized_url: 'https://example.com/page2', last_seen_snapshot_id: 123 }
]);

const page1Id = urlToIdMap.get('https://example.com/page1');

pages

Array<Partial<Page>>

required

Array of page records to upsert

return

Map<string, number>

Map of normalized URLs to their page IDs

getPage

Retrieve a page by site ID and URL.

const page = pageRepo.getPage(1, 'https://example.com/page');

siteId

number

required

The site ID

url

string

required

The normalized URL

return

Page | undefined

The page record, or undefined if not found

getPagesByUrls

Retrieve multiple pages by their URLs.

const pages = pageRepo.getPagesByUrls(1, [
  'https://example.com/page1',
  'https://example.com/page2'
]);

siteId

number

required

The site ID

urls

string[]

required

Array of normalized URLs

return

Page[]

Array of matching page records

This method handles large URL arrays by chunking them into batches of 900 to avoid SQLite parameter limits.

getPagesBySnapshot

Retrieve all pages visible in a snapshot.

const pages = pageRepo.getPagesBySnapshot(snapshotId);

snapshotId

number

required

The snapshot ID

return

Page[]

Array of all pages first seen on or before this snapshot

getPagesIdentityBySnapshot

Retrieve page IDs and URLs for a snapshot (lightweight query).

const pageIdentities = pageRepo.getPagesIdentityBySnapshot(snapshotId);
// [{ id: 1, normalized_url: 'https://example.com/page' }, ...]

snapshotId

number

required

The snapshot ID

return

Array<{ id: number, normalized_url: string }>

Array of page IDs and URLs

getPagesIteratorBySnapshot

Get an iterator for pages in a snapshot (memory-efficient).

for (const page of pageRepo.getPagesIteratorBySnapshot(snapshotId)) {
  console.log(page.normalized_url);
}

snapshotId

number

required

The snapshot ID

return

IterableIterator<Page>

Iterator for pages in the snapshot

getIdByUrl

Get a page ID by site and URL.

const pageId = pageRepo.getIdByUrl(1, 'https://example.com/page');

siteId

number

required

The site ID

url

string

required

The normalized URL

return

number | undefined

The page ID, or undefined if not found

EdgeRepository

Manages links between pages in the crawl graph.

Interface

interface Edge {
  id: number;
  snapshot_id: number;
  source_page_id: number;
  target_page_id: number;
  weight: number;
  rel: 'nofollow' | 'sponsored' | 'ugc' | 'internal' | 'external' | 'unknown';
}

Constructor

import { getDb } from '@crawlith/core/db';
import { EdgeRepository } from '@crawlith/core/db';

const db = getDb();
const edgeRepo = new EdgeRepository(db);

Database

required

SQLite database instance from getDb()

insertEdge

Insert a single edge between two pages.

edgeRepo.insertEdge(
  snapshotId,
  sourcePageId,
  targetPageId,
  1.0,
  'internal'
);

snapshotId

number

required

The snapshot this edge belongs to

sourcePageId

number

required

The page ID where the link originates

targetPageId

number

required

The page ID the link points to

weight

number

default:"1.0"

Link weight (used in PageRank calculations)

rel

string

default:"'internal'"

Link relationship: ‘nofollow’, ‘sponsored’, ‘ugc’, ‘internal’, ‘external’, ‘unknown’

insertEdges

Batch insert multiple edges in a transaction.

edgeRepo.insertEdges([
  { snapshot_id: 123, source_page_id: 1, target_page_id: 2, weight: 1.0, rel: 'internal' },
  { snapshot_id: 123, source_page_id: 1, target_page_id: 3, weight: 1.0, rel: 'internal' }
]);

edges

Array<Edge>

required

Array of edge records to insert

getEdgesBySnapshot

Retrieve all edges for a snapshot.

const edges = edgeRepo.getEdgesBySnapshot(snapshotId);

snapshotId

number

required

The snapshot ID

return

Edge[]

Array of all edges in the snapshot

getEdgesIteratorBySnapshot

Get an iterator for edges in a snapshot (memory-efficient).

for (const edge of edgeRepo.getEdgesIteratorBySnapshot(snapshotId)) {
  console.log(`${edge.source_page_id} -> ${edge.target_page_id}`);
}

snapshotId

number

required

The snapshot ID

return

IterableIterator<Edge>

Iterator for edges in the snapshot

MetricsRepository

Manages computed metrics for pages (PageRank, authority scores, duplicate detection).

Interface

interface DbMetrics {
  snapshot_id: number;
  page_id: number;
  authority_score: number | null;
  hub_score: number | null;
  pagerank: number | null;
  pagerank_score: number | null;
  link_role: 'hub' | 'authority' | 'power' | 'balanced' | 'peripheral' | null;
  crawl_status: string | null;
  word_count: number | null;
  thin_content_score: number | null;
  external_link_ratio: number | null;
  orphan_score: number | null;
  duplicate_cluster_id: string | null;
  duplicate_type: 'exact' | 'near' | 'template_heavy' | 'none' | null;
  is_cluster_primary: number;
}

Constructor

import { getDb } from '@crawlith/core/db';
import { MetricsRepository } from '@crawlith/core/db';

const db = getDb();
const metricsRepo = new MetricsRepository(db);

Database

required

SQLite database instance from getDb()

insertMetrics

Insert or replace metrics for a page.

metricsRepo.insertMetrics({
  snapshot_id: 123,
  page_id: 1,
  pagerank: 0.85,
  authority_score: 0.92,
  hub_score: 0.45,
  link_role: 'authority',
  crawl_status: 'fetched',
  word_count: 1200,
  orphan_score: 0,
  duplicate_type: 'none',
  is_cluster_primary: 0
});

metrics

DbMetrics

required

Metrics record to insert or replace

insertMany

Batch insert metrics in a transaction.

metricsRepo.insertMany([
  { snapshot_id: 123, page_id: 1, pagerank: 0.85, ... },
  { snapshot_id: 123, page_id: 2, pagerank: 0.72, ... }
]);

metricsList

DbMetrics[]

required

Array of metrics records to insert

getMetrics

Retrieve all metrics for a snapshot.

const metrics = metricsRepo.getMetrics(snapshotId);

snapshotId

number

required

The snapshot ID

return

DbMetrics[]

Array of all metrics in the snapshot

getMetricsIterator

Get an iterator for metrics in a snapshot (memory-efficient).

for (const metric of metricsRepo.getMetricsIterator(snapshotId)) {
  console.log(`Page ${metric.page_id}: PageRank ${metric.pagerank}`);
}

snapshotId

number

required

The snapshot ID

return

IterableIterator<DbMetrics>

Iterator for metrics in the snapshot

getMetricsForPage

Retrieve metrics for a specific page in a snapshot.

const metrics = metricsRepo.getMetricsForPage(snapshotId, pageId);
if (metrics) {
  console.log(`PageRank: ${metrics.pagerank}`);
}

snapshotId

number

required

The snapshot ID

pageId

number

required

The page ID

return

DbMetrics | undefined

Metrics for the page, or undefined if not found

Example: Complete Workflow

import { getDb } from '@crawlith/core/db';
import {
  SiteRepository,
  SnapshotRepository,
  PageRepository,
  EdgeRepository,
  MetricsRepository
} from '@crawlith/core/db';

const db = getDb();

// Initialize repositories
const siteRepo = new SiteRepository(db);
const snapshotRepo = new SnapshotRepository(db);
const pageRepo = new PageRepository(db);
const edgeRepo = new EdgeRepository(db);
const metricsRepo = new MetricsRepository(db);

// Create or get site
const site = siteRepo.firstOrCreateSite('example.com');

// Create snapshot
const snapshotId = snapshotRepo.createSnapshot(site.id, 'full');

// Upsert pages
const urlToId = pageRepo.upsertMany([
  {
    site_id: site.id,
    normalized_url: 'https://example.com/',
    last_seen_snapshot_id: snapshotId,
    http_status: 200,
    depth: 0
  },
  {
    site_id: site.id,
    normalized_url: 'https://example.com/about',
    last_seen_snapshot_id: snapshotId,
    http_status: 200,
    depth: 1
  }
]);

// Insert edges
const homeId = urlToId.get('https://example.com/')!;
const aboutId = urlToId.get('https://example.com/about')!;

edgeRepo.insertEdges([
  {
    snapshot_id: snapshotId,
    source_page_id: homeId,
    target_page_id: aboutId,
    weight: 1.0,
    rel: 'internal'
  }
]);

// Insert metrics
metricsRepo.insertMany([
  {
    snapshot_id: snapshotId,
    page_id: homeId,
    pagerank: 0.85,
    authority_score: 0.92,
    hub_score: 0.45,
    link_role: 'authority',
    crawl_status: 'fetched',
    word_count: 1200,
    orphan_score: 0,
    duplicate_type: 'none',
    is_cluster_primary: 0
  }
]);

// Update snapshot status
snapshotRepo.updateSnapshotStatus(snapshotId, 'completed', {
  node_count: 2,
  edge_count: 1,
  health_score: 0.95
});

// Query results
const pages = pageRepo.getPagesBySnapshot(snapshotId);
const edges = edgeRepo.getEdgesBySnapshot(snapshotId);
const metrics = metricsRepo.getMetrics(snapshotId);

console.log(`Crawled ${pages.length} pages with ${edges.length} links`);

Database Overview

Learn about the SQLite schema, database location, and configuration

Core Library

Database

​Overview

​SiteRepository

​Interface

​Constructor

​getSite

​getSiteById

​getAllSites

​createSite

​firstOrCreateSite

​deleteSite

​SnapshotRepository

​Interface

​Constructor

​createSnapshot

​getLatestSnapshot

​getSnapshot

​getSnapshotCount

​updateSnapshotStatus

​deleteSnapshot

​PageRepository

​Interface

​Constructor

​upsertPage

​upsertAndGetId

​upsertMany

​getPage

​getPagesByUrls

​getPagesBySnapshot

​getPagesIdentityBySnapshot

​getPagesIteratorBySnapshot

​getIdByUrl

​EdgeRepository

​Interface

​Constructor

​insertEdge

​insertEdges

​getEdgesBySnapshot

​getEdgesIteratorBySnapshot

​MetricsRepository

​Interface

​Constructor

​insertMetrics

​insertMany

​getMetrics

​getMetricsIterator

​getMetricsForPage

​Example: Complete Workflow

​Related

Database Overview

Build docs developers (and LLMs) love

Overview

SiteRepository

Interface

Constructor

getSite

getSiteById

getAllSites

createSite

firstOrCreateSite

deleteSite

SnapshotRepository

Interface

Constructor

createSnapshot

getLatestSnapshot

getSnapshot

getSnapshotCount

updateSnapshotStatus

deleteSnapshot

PageRepository

Interface

Constructor

upsertPage

upsertAndGetId

upsertMany

getPage

getPagesByUrls

getPagesBySnapshot

getPagesIdentityBySnapshot

getPagesIteratorBySnapshot

getIdByUrl

EdgeRepository

Interface

Constructor

insertEdge

insertEdges

getEdgesBySnapshot

getEdgesIteratorBySnapshot

MetricsRepository

Interface

Constructor

insertMetrics

insertMany

getMetrics

getMetricsIterator

getMetricsForPage

Example: Complete Workflow

Related