ArcHive’s intelligent parsing system automatically processes URLs to extract rich metadata, ensuring your archived content is well-organized and easy to find.
Parser Architecture
ArcHive uses a multi-tiered parsing system that selects the appropriate parser based on the URL’s domain:
export const parseUrl = async ( url : string ) => {
let parsedResult = null ;
if ( url . includes ( "github.com" )) {
parsedResult = await githubParser ( url );
} else if ( url . includes ( "instagram.com" )) {
parsedResult = await instagramParser ( url );
} else if ( url . includes ( "youtube.com" )) {
parsedResult = await youtubeParser ( url );
} else if ( url . includes ( "linkedin.com" )) {
parsedResult = await linkedInParser ( url );
} else if ( url . includes ( "twitter.com" ) || url . includes ( "x.com" )) {
parsedResult = await xParser ( url );
}
if ( parsedResult ) {
return parsedResult ;
}
return genericParser ( url );
};
Source: backend/src/parsers/index.ts:19-39
Specialized Parsers
GitHub Parser
For GitHub repositories, ArcHive uses the GitHub API to fetch rich repository data:
Extract Repository Information
Parse the URL to extract owner and repository name: const parts = url . split ( "/" );
const owner = parts [ 3 ];
const repo = parts [ 4 ];
Fetch from GitHub API
Query the GitHub API for repository details: const apiUrl = `https://api.github.com/repos/ ${ owner } / ${ repo } ` ;
const response = await axios . get ( apiUrl );
const repoData = response . data ;
Return Structured Data
Extract and return key metadata: return {
type: "link" ,
title: repoData . full_name ,
description: repoData . description ,
url: repoData . html_url ,
platform: PLATFORMS . GITHUB ,
};
Source: backend/src/parsers/github.parser.ts:4-24
YouTube Parser
YouTube videos are parsed using the YouTube Data API v3:
Extract Video ID
Support multiple URL formats (watch, youtu.be, shorts): const extractVideoId = ( url : string ) : string | null => {
const match = url . match (
/ (?: youtu \. be \/ | youtube \. com (?: \/ embed \/ | \/ v \/ | \/ watch \? v= | \/ shorts \/ ))( [ -\w ] {11} ) / ,
);
return match ? match [ 1 ] : null ;
};
Query YouTube API
Fetch video details using the YouTube Data API: const response = await youtube . videos . list ({
part: [ "snippet" ],
id: [ videoId ],
});
Extract Thumbnail
Select the highest quality thumbnail available: const thumbnail =
snippet . thumbnails ?. maxres ?. url ||
snippet . thumbnails ?. high ?. url ||
snippet . thumbnails ?. medium ?. url ||
snippet . thumbnails ?. default ?. url ||
"" ;
Source: backend/src/parsers/youtube.parser.ts:17-85
The YouTube parser requires a YOUTUBE_API_KEY environment variable to be configured.
X (Twitter) content is parsed using Puppeteer for dynamic content rendering:
export const xParser = async ( url : string ) => {
const browser = await browserManager . getBrowser ();
const page = await browser . newPage ();
try {
await page . goto ( url , { waitUntil: "networkidle2" , timeout: 30000 });
const metadata = await page . evaluate (() => {
const getMetaContent = ( property : string ) =>
document
. querySelector ( `meta[property=" ${ property } "]` )
?. getAttribute ( "content" ) || "" ;
const title =
getMetaContent ( "og:title" ) ||
document . querySelector ( "title" )?. textContent ||
"" ;
const description = getMetaContent ( "og:description" ) || "" ;
const previewImageUrl = getMetaContent ( "og:image" ) || "" ;
return { title , description , previewImageUrl };
});
return {
type: "link" ,
title: metadata . title . trim (),
description: metadata . description ? metadata . description . trim () : "" ,
url: url ,
previewImageUrl: metadata . previewImageUrl || "" ,
platform: PLATFORMS . TWITTER ,
};
} finally {
await page . close ();
}
};
Source: backend/src/parsers/x.parser.ts:14-72
Generic Parser
For all other URLs, the generic parser uses Open Graph and meta tags:
export const genericParser = async ( url : string , headers ?: any ) => {
const response = await axios . get ( url , { headers });
const html = response . data ;
const $ = cheerio . load ( html );
const rawTitle =
$ ( 'meta[property="og:title"]' ). attr ( "content" ) || $ ( "title" ). text ();
const title = rawTitle ? rawTitle . trim () : "" ;
const description =
$ ( 'meta[property="og:description"]' ). attr ( "content" ) ||
$ ( 'meta[name="description"]' ). attr ( "content" );
const previewImageUrl = $ ( 'meta[property="og:image"]' ). attr ( "content" );
return {
type: "link" ,
title: title . trim (),
description: description ? description . trim () : "" ,
url: url ,
previewImageUrl: previewImageUrl || "" ,
platform: extractPlatformFromUrl ( url ),
};
};
Source: backend/src/parsers/generic.parser.ts:5-26
Automatic Screenshot Generation
For every link saved, ArcHive generates a screenshot using Puppeteer:
Screenshot generation happens asynchronously via a BullMQ queue, so it doesn’t slow down content creation.
The screenshot queue is triggered after content creation:
screenshotQueue
. add ( "screenshot-queue" , {
contentId: newContent . _id ,
url: newContent . url ,
userId: userId ,
})
. catch (( err ) =>
console . error ( "Failed to enqueue screenshot job" , {
contentId: newContent . _id ,
error: err ,
}),
);
Source: backend/src/services/content.service.ts:44-55
Screenshots are stored on Cloudinary for fast, reliable access.
Intelligent Tag Generation
ArcHive automatically suggests tags by analyzing the content:
Tag Generation Process
Extract Content
Try to get content from the parsed metadata first: const genericParsed = await parseUrl ( url );
if ( genericParsed ?. description ) {
return extractRelevantTags ( genericParsed . description );
}
Fallback to Web Scraping
If no description is available, use Puppeteer to scrape the page:
Check meta keywords
Extract Open Graph description
Parse article content
Use Mozilla Readability for content extraction
Extract and Stem Tags
Process the text to extract meaningful tags:
Tokenize the text
Apply Porter Stemming algorithm
Filter and rank by relevance
Source: backend/src/utils/generateTagsFromUrl.ts:48-130
Tags are processed using natural language processing (NLP) with the Porter Stemmer algorithm to ensure consistency (e.g., “running” and “run” become the same tag).
ArcHive automatically categorizes content by platform:
export const extractPlatformFromUrl = ( url : string ) : string => {
const urlObj = new URL ( url );
const hostname = urlObj . hostname . toLowerCase ();
const domain = hostname . replace ( / ^ www \. / , "" );
if ( domain . includes ( "github.com" )) return PLATFORMS . GITHUB ;
if ( domain . includes ( "youtube.com" ) || domain . includes ( "youtu.be" ))
return PLATFORMS . YOUTUBE ;
if ( domain . includes ( "twitter.com" ) || domain . includes ( "x.com" ))
return PLATFORMS . TWITTER ;
if ( domain . includes ( "instagram.com" )) return PLATFORMS . INSTAGRAM ;
// ... more platforms
return domain || PLATFORMS . OTHER ;
};
Source: backend/src/constants/platforms.ts:26-67
View all supported platforms
Next Steps
Platform Categorization Learn how to browse content by platform
Search Discover powerful search capabilities