Overview
Crawlith provides repository classes for structured access to database tables. Each repository encapsulates database queries and provides type-safe methods for common operations.
SiteRepository
Manages website configurations and metadata.
Interface
interface Site {
id : number ;
domain : string ;
created_at : string ;
settings_json : string | null ;
is_active : number ;
}
Constructor
import { getDb } from '@crawlith/core/db' ;
import { SiteRepository } from '@crawlith/core/db' ;
const db = getDb ();
const siteRepo = new SiteRepository ( db );
SQLite database instance from getDb()
getSite
Retrieve a site by domain.
const site = siteRepo . getSite ( 'example.com' );
if ( site ) {
console . log ( `Site ID: ${ site . id } ` );
}
The domain name to look up
The site record, or undefined if not found
getSiteById
Retrieve a site by its numeric ID.
const site = siteRepo . getSiteById ( 1 );
The site’s numeric identifier
The site record, or undefined if not found
getAllSites
Retrieve all sites, ordered by domain.
const allSites = siteRepo . getAllSites ();
allSites . forEach ( site => {
console . log ( site . domain );
});
Array of all site records
createSite
Create a new site record.
const siteId = siteRepo . createSite ( 'example.com' );
The domain name for the new site
The auto-incremented ID of the newly created site
firstOrCreateSite
Get an existing site or create it if it doesn’t exist.
const site = siteRepo . firstOrCreateSite ( 'example.com' );
// Always returns a valid Site object
The domain name to retrieve or create
The existing or newly created site record
deleteSite
Delete a site and all associated data (cascades to snapshots, pages, edges, metrics).
SnapshotRepository
Manages crawl sessions and snapshot metadata.
Interface
interface Snapshot {
id : number ;
site_id : number ;
type : 'full' | 'partial' | 'incremental' ;
created_at : string ;
node_count : number ;
edge_count : number ;
status : 'running' | 'completed' | 'failed' ;
limit_reached : number ;
health_score : number | null ;
orphan_count : number | null ;
thin_content_count : number | null ;
}
Constructor
import { getDb } from '@crawlith/core/db' ;
import { SnapshotRepository } from '@crawlith/core/db' ;
const db = getDb ();
const snapshotRepo = new SnapshotRepository ( db );
SQLite database instance from getDb()
createSnapshot
Create a new crawl snapshot.
const snapshotId = snapshotRepo . createSnapshot (
siteId ,
'full' ,
'running'
);
The site ID this snapshot belongs to
type
'full' | 'partial' | 'incremental'
required
The snapshot type
status
'running' | 'completed' | 'failed'
default: "'running'"
Initial status of the snapshot
The auto-incremented ID of the newly created snapshot
getLatestSnapshot
Retrieve the most recent snapshot for a site.
const snapshot = snapshotRepo . getLatestSnapshot (
siteId ,
'completed' ,
false
);
status
'running' | 'completed' | 'failed'
Optional status filter
Whether to include partial snapshots in the search
The latest snapshot matching the criteria, or undefined
getSnapshot
Retrieve a snapshot by ID.
const snapshot = snapshotRepo . getSnapshot ( 123 );
The snapshot record, or undefined if not found
getSnapshotCount
Get the total number of snapshots for a site.
const count = snapshotRepo . getSnapshotCount ( siteId );
The site ID to count snapshots for
Total number of snapshots for the site
updateSnapshotStatus
Update a snapshot’s status and statistics.
snapshotRepo . updateSnapshotStatus ( snapshotId , 'completed' , {
node_count: 150 ,
edge_count: 450 ,
health_score: 0.87 ,
orphan_count: 5 ,
thin_content_count: 12
});
The snapshot ID to update
status
'completed' | 'failed'
required
The new status
stats
Partial<Snapshot>
default: "{}"
Optional statistics to update (node_count, edge_count, limit_reached, health_score, orphan_count, thin_content_count)
deleteSnapshot
Delete a snapshot and clean up orphaned pages.
snapshotRepo . deleteSnapshot ( snapshotId );
The snapshot ID to delete
This method runs in a transaction and will:
Unlink pages from the snapshot
Delete pages no longer referenced by any snapshot
Delete the snapshot record
Cascade delete edges and metrics via foreign key constraints
Manages discovered pages and their metadata.
Interface
interface Page {
id : number ;
site_id : number ;
normalized_url : string ;
first_seen_snapshot_id : number | null ;
last_seen_snapshot_id : number | null ;
http_status : number | null ;
canonical_url : string | null ;
content_hash : string | null ;
simhash : string | null ;
etag : string | null ;
last_modified : string | null ;
html : string | null ;
soft404_score : number | null ;
noindex : number ;
nofollow : number ;
security_error : string | null ;
retries : number ;
depth : number ;
redirect_chain : string | null ;
bytes_received : number | null ;
crawl_trap_flag : number ;
crawl_trap_risk : number | null ;
trap_type : string | null ;
created_at : string ;
updated_at : string ;
}
Constructor
import { getDb } from '@crawlith/core/db' ;
import { PageRepository } from '@crawlith/core/db' ;
const db = getDb ();
const pageRepo = new PageRepository ( db );
SQLite database instance from getDb()
upsertPage
Insert or update a page record.
pageRepo . upsertPage ({
site_id: 1 ,
normalized_url: 'https://example.com/page' ,
last_seen_snapshot_id: 123 ,
http_status: 200 ,
content_hash: 'abc123' ,
depth: 1
});
Page data to insert or update. Must include site_id, normalized_url, and last_seen_snapshot_id.
On conflict (duplicate site_id + normalized_url), the method updates the existing record while preserving non-null values using COALESCE.
upsertAndGetId
Upsert a page and return its ID.
const pageId = pageRepo . upsertAndGetId ({
site_id: 1 ,
normalized_url: 'https://example.com/page' ,
last_seen_snapshot_id: 123 ,
http_status: 200
});
Page data to insert or update
The page ID (existing or newly created)
upsertMany
Batch upsert multiple pages in a transaction.
const urlToIdMap = pageRepo . upsertMany ([
{ site_id: 1 , normalized_url: 'https://example.com/page1' , last_seen_snapshot_id: 123 },
{ site_id: 1 , normalized_url: 'https://example.com/page2' , last_seen_snapshot_id: 123 }
]);
const page1Id = urlToIdMap . get ( 'https://example.com/page1' );
pages
Array<Partial<Page>>
required
Array of page records to upsert
Map of normalized URLs to their page IDs
getPage
Retrieve a page by site ID and URL.
const page = pageRepo . getPage ( 1 , 'https://example.com/page' );
The page record, or undefined if not found
getPagesByUrls
Retrieve multiple pages by their URLs.
const pages = pageRepo . getPagesByUrls ( 1 , [
'https://example.com/page1' ,
'https://example.com/page2'
]);
Array of matching page records
This method handles large URL arrays by chunking them into batches of 900 to avoid SQLite parameter limits.
getPagesBySnapshot
Retrieve all pages visible in a snapshot.
const pages = pageRepo . getPagesBySnapshot ( snapshotId );
Array of all pages first seen on or before this snapshot
getPagesIdentityBySnapshot
Retrieve page IDs and URLs for a snapshot (lightweight query).
const pageIdentities = pageRepo . getPagesIdentityBySnapshot ( snapshotId );
// [{ id: 1, normalized_url: 'https://example.com/page' }, ...]
return
Array<{ id: number, normalized_url: string }>
Array of page IDs and URLs
getPagesIteratorBySnapshot
Get an iterator for pages in a snapshot (memory-efficient).
for ( const page of pageRepo . getPagesIteratorBySnapshot ( snapshotId )) {
console . log ( page . normalized_url );
}
Iterator for pages in the snapshot
getIdByUrl
Get a page ID by site and URL.
const pageId = pageRepo . getIdByUrl ( 1 , 'https://example.com/page' );
The page ID, or undefined if not found
EdgeRepository
Manages links between pages in the crawl graph.
Interface
interface Edge {
id : number ;
snapshot_id : number ;
source_page_id : number ;
target_page_id : number ;
weight : number ;
rel : 'nofollow' | 'sponsored' | 'ugc' | 'internal' | 'external' | 'unknown' ;
}
Constructor
import { getDb } from '@crawlith/core/db' ;
import { EdgeRepository } from '@crawlith/core/db' ;
const db = getDb ();
const edgeRepo = new EdgeRepository ( db );
SQLite database instance from getDb()
insertEdge
Insert a single edge between two pages.
edgeRepo . insertEdge (
snapshotId ,
sourcePageId ,
targetPageId ,
1.0 ,
'internal'
);
The snapshot this edge belongs to
The page ID where the link originates
The page ID the link points to
Link weight (used in PageRank calculations)
rel
string
default: "'internal'"
Link relationship: ‘nofollow’, ‘sponsored’, ‘ugc’, ‘internal’, ‘external’, ‘unknown’
insertEdges
Batch insert multiple edges in a transaction.
edgeRepo . insertEdges ([
{ snapshot_id: 123 , source_page_id: 1 , target_page_id: 2 , weight: 1.0 , rel: 'internal' },
{ snapshot_id: 123 , source_page_id: 1 , target_page_id: 3 , weight: 1.0 , rel: 'internal' }
]);
Array of edge records to insert
getEdgesBySnapshot
Retrieve all edges for a snapshot.
const edges = edgeRepo . getEdgesBySnapshot ( snapshotId );
Array of all edges in the snapshot
getEdgesIteratorBySnapshot
Get an iterator for edges in a snapshot (memory-efficient).
for ( const edge of edgeRepo . getEdgesIteratorBySnapshot ( snapshotId )) {
console . log ( ` ${ edge . source_page_id } -> ${ edge . target_page_id } ` );
}
Iterator for edges in the snapshot
MetricsRepository
Manages computed metrics for pages (PageRank, authority scores, duplicate detection).
Interface
interface DbMetrics {
snapshot_id : number ;
page_id : number ;
authority_score : number | null ;
hub_score : number | null ;
pagerank : number | null ;
pagerank_score : number | null ;
link_role : 'hub' | 'authority' | 'power' | 'balanced' | 'peripheral' | null ;
crawl_status : string | null ;
word_count : number | null ;
thin_content_score : number | null ;
external_link_ratio : number | null ;
orphan_score : number | null ;
duplicate_cluster_id : string | null ;
duplicate_type : 'exact' | 'near' | 'template_heavy' | 'none' | null ;
is_cluster_primary : number ;
}
Constructor
import { getDb } from '@crawlith/core/db' ;
import { MetricsRepository } from '@crawlith/core/db' ;
const db = getDb ();
const metricsRepo = new MetricsRepository ( db );
SQLite database instance from getDb()
insertMetrics
Insert or replace metrics for a page.
metricsRepo . insertMetrics ({
snapshot_id: 123 ,
page_id: 1 ,
pagerank: 0.85 ,
authority_score: 0.92 ,
hub_score: 0.45 ,
link_role: 'authority' ,
crawl_status: 'fetched' ,
word_count: 1200 ,
orphan_score: 0 ,
duplicate_type: 'none' ,
is_cluster_primary: 0
});
Metrics record to insert or replace
insertMany
Batch insert metrics in a transaction.
metricsRepo . insertMany ([
{ snapshot_id: 123 , page_id: 1 , pagerank: 0.85 , ... },
{ snapshot_id: 123 , page_id: 2 , pagerank: 0.72 , ... }
]);
Array of metrics records to insert
getMetrics
Retrieve all metrics for a snapshot.
const metrics = metricsRepo . getMetrics ( snapshotId );
Array of all metrics in the snapshot
getMetricsIterator
Get an iterator for metrics in a snapshot (memory-efficient).
for ( const metric of metricsRepo . getMetricsIterator ( snapshotId )) {
console . log ( `Page ${ metric . page_id } : PageRank ${ metric . pagerank } ` );
}
return
IterableIterator<DbMetrics>
Iterator for metrics in the snapshot
getMetricsForPage
Retrieve metrics for a specific page in a snapshot.
const metrics = metricsRepo . getMetricsForPage ( snapshotId , pageId );
if ( metrics ) {
console . log ( `PageRank: ${ metrics . pagerank } ` );
}
Metrics for the page, or undefined if not found
Example: Complete Workflow
import { getDb } from '@crawlith/core/db' ;
import {
SiteRepository ,
SnapshotRepository ,
PageRepository ,
EdgeRepository ,
MetricsRepository
} from '@crawlith/core/db' ;
const db = getDb ();
// Initialize repositories
const siteRepo = new SiteRepository ( db );
const snapshotRepo = new SnapshotRepository ( db );
const pageRepo = new PageRepository ( db );
const edgeRepo = new EdgeRepository ( db );
const metricsRepo = new MetricsRepository ( db );
// Create or get site
const site = siteRepo . firstOrCreateSite ( 'example.com' );
// Create snapshot
const snapshotId = snapshotRepo . createSnapshot ( site . id , 'full' );
// Upsert pages
const urlToId = pageRepo . upsertMany ([
{
site_id: site . id ,
normalized_url: 'https://example.com/' ,
last_seen_snapshot_id: snapshotId ,
http_status: 200 ,
depth: 0
},
{
site_id: site . id ,
normalized_url: 'https://example.com/about' ,
last_seen_snapshot_id: snapshotId ,
http_status: 200 ,
depth: 1
}
]);
// Insert edges
const homeId = urlToId . get ( 'https://example.com/' ) ! ;
const aboutId = urlToId . get ( 'https://example.com/about' ) ! ;
edgeRepo . insertEdges ([
{
snapshot_id: snapshotId ,
source_page_id: homeId ,
target_page_id: aboutId ,
weight: 1.0 ,
rel: 'internal'
}
]);
// Insert metrics
metricsRepo . insertMany ([
{
snapshot_id: snapshotId ,
page_id: homeId ,
pagerank: 0.85 ,
authority_score: 0.92 ,
hub_score: 0.45 ,
link_role: 'authority' ,
crawl_status: 'fetched' ,
word_count: 1200 ,
orphan_score: 0 ,
duplicate_type: 'none' ,
is_cluster_primary: 0
}
]);
// Update snapshot status
snapshotRepo . updateSnapshotStatus ( snapshotId , 'completed' , {
node_count: 2 ,
edge_count: 1 ,
health_score: 0.95
});
// Query results
const pages = pageRepo . getPagesBySnapshot ( snapshotId );
const edges = edgeRepo . getEdgesBySnapshot ( snapshotId );
const metrics = metricsRepo . getMetrics ( snapshotId );
console . log ( `Crawled ${ pages . length } pages with ${ edges . length } links` );
Database Overview Learn about the SQLite schema, database location, and configuration