Social Analyzer can extract structured metadata and patterns from detected profiles to gather intelligence about the profile owner. This information is crucial for OSINT investigations and profile correlation.
Overview
The extraction module (extraction.js) provides two main capabilities:
- Metadata Extraction: Extracts meta tags from profile HTML
- Pattern Extraction: Finds specific patterns like emails, phones, and links
Both features are optional and activated with command-line flags.
Metadata extraction parses HTML meta tags to collect information about profiles, including social graph data, descriptions, images, and other structured information.
How It Works
From extraction.js:6-58, the metadata extractor:
async function extract_metadata (site, source) {
const $ = cheerio.load(source)
const meta = $('meta')
const temp_metadata_list = []
const temp_metadata_for_checking = []
Object.keys(meta).forEach(function (key) {
if (meta[key].attribs) {
if (!strings_meta.test(JSON.stringify(meta[key].attribs))) {
const temp_dict = {}
if (meta[key].attribs.property) {
temp_dict.property = meta[key].attribs.property
}
if (meta[key].attribs.name) {
temp_dict.name = meta[key].attribs.name
}
if (meta[key].attribs.itemprop) {
temp_dict.itemprop = meta[key].attribs.itemprop
}
if (meta[key].attribs.content) {
temp_dict.content = meta[key].attribs.content
}
temp_metadata_list.push(temp_dict)
}
}
})
return temp_metadata_list
}
The extractor filters out technical meta tags using regex patterns:
const strings_meta = new RegExp(
'regionsAllowed|width|height|color|rgba\\(|charset|viewport|refresh|equiv',
'i'
)
This ensures only meaningful metadata is extracted.
The extractor handles three types of meta tags:
-
Property-based (Open Graph, Facebook)
<meta property="og:title" content="John Doe">
<meta property="og:image" content="https://...">
-
Name-based (Twitter, standard meta)
<meta name="description" content="Developer and designer">
<meta name="twitter:card" content="summary">
-
Itemprop-based (Schema.org)
<meta itemprop="name" content="John Doe">
<meta itemprop="description" content="...">
Duplicate Handling
From extraction.js:34-45, the extractor combines duplicate metadata:
['property', 'name', 'itemprop'].forEach((item, i) => {
if (temp_dict[item]) {
temp_metadata_list.forEach((_item, i) => {
if (_item[item]) {
if (_item[item] === temp_dict[item]) {
// Combine duplicate entries
temp_metadata_list[i].content += ', ' + temp_dict.content
add = false
}
}
})
}
})
This prevents redundant metadata entries while preserving multiple values.
Usage
# Enable metadata extraction
node app.js --username "johndoe" --metadata
# Metadata with specific websites
node app.js --username "johndoe" --metadata --websites "twitter facebook"
# Metadata with top-ranked sites
node app.js --username "johndoe" --metadata --top 100
Example Output
{
"username": "johndoe",
"link": "https://twitter.com/johndoe",
"status": "good",
"metadata": [
{
"property": "og:title",
"content": "John Doe (@johndoe)"
},
{
"property": "og:description",
"content": "Software Developer | Open Source Enthusiast"
},
{
"property": "og:image",
"content": "https://pbs.twimg.com/profile_images/..."
},
{
"name": "twitter:card",
"content": "summary"
},
{
"name": "description",
"content": "John Doe's profile on Twitter"
}
]
}
Metadata extraction only occurs for profiles with “good” status to reduce processing time and focus on confirmed matches.
Pattern extraction uses regular expressions to find specific information patterns within profile HTML source code.
How It Works
From extraction.js:60-87, the pattern extractor:
async function extract_patterns (site, source) {
const temp_patterns_list = []
const temp_patterns_for_checking = []
if ('extract' in site) {
site.extract.forEach((item, i) => {
const regex_pattern = new RegExp(item.regex, 'g')
let found = null
while (found = regex_pattern.exec(source)) {
if (!temp_patterns_for_checking.includes(found[1])) {
temp_patterns_for_checking.push(found[1])
if (item.type === 'link') {
found[1] = decodeURIComponent(found[1])
}
temp_patterns_list.push({
type: item.type,
matched: found[1]
})
}
}
})
}
return temp_patterns_list
}
Pattern Types
Each website in the detection database can define custom extraction patterns:
- Email addresses
- Phone numbers
- Social media links
- Website URLs
- User IDs
- Custom patterns
Patterns are configured per website in sites.json:
{
"url": "https://example.com/{username}",
"extract": [
{
"type": "email",
"regex": "([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})"
},
{
"type": "phone",
"regex": "\\+?[0-9]{1,3}[-.\\s]?[0-9]{3}[-.\\s]?[0-9]{4}"
},
{
"type": "link",
"regex": "href=\"(https?://[^\"]+)\""
}
]
}
URL Decoding
Links are automatically URL-decoded to make them human-readable:
if (item.type === 'link') {
found[1] = decodeURIComponent(found[1])
}
Usage
# Enable pattern extraction
node app.js --username "johndoe" --extract
# Extract both metadata and patterns
node app.js --username "johndoe" --metadata --extract
# With specific sites
node app.js --username "johndoe" --extract --websites "linkedin github"
Example Output
{
"username": "johndoe",
"link": "https://github.com/johndoe",
"status": "good",
"extracted": [
{
"type": "email",
"matched": "[email protected]"
},
{
"type": "link",
"matched": "https://johndoe.com"
},
{
"type": "link",
"matched": "https://twitter.com/johndoe"
}
]
}
Pattern extraction is particularly useful for finding cross-platform connections and building a comprehensive profile of the target.
Integration with Detection Modes
Extraction features work with both fast and slow detection modes.
Fast Mode Integration
From fast-scan.js:162-177:
if (temp_profile.status === 'good') {
if (options.includes('ExtractPatterns')) {
let temp_extracted_list = []
temp_extracted_list = await extraction.extract_patterns(site, source)
if (temp_extracted_list.length > 0) {
temp_profile.extracted = temp_extracted_list
}
}
if (options.includes('ExtractMetadata')) {
let temp_metadata_list = []
temp_metadata_list = await extraction.extract_metadata(site, source)
if (temp_metadata_list.length > 0) {
temp_profile.metadata = temp_metadata_list
}
}
}
Slow Mode Integration
From slow-scan.js:136-151, slow mode uses the same extraction logic:
if (temp_profile.status === 'good') {
if (options.includes('ExtractPatterns')) {
let temp_extracted_list = []
temp_extracted_list = await extraction.extract_patterns(site, source)
if (temp_extracted_list.length > 0) {
temp_profile.extracted = temp_extracted_list
}
}
if (options.includes('ExtractMetadata')) {
let temp_metadata_list = []
temp_metadata_list = await extraction.extract_metadata(site, source)
if (temp_metadata_list.length > 0) {
temp_profile.metadata = temp_metadata_list
}
}
}
Extraction only occurs after a profile is confirmed with “good” status to optimize performance.
Open Graph (Facebook)
og:title - Profile or page title
og:description - Profile bio or description
og:image - Profile picture URL
og:url - Canonical profile URL
og:type - Content type (profile, article, etc.)
og:site_name - Platform name
twitter:card - Card type (summary, player, etc.)
twitter:site - Site’s Twitter handle
twitter:creator - Content creator’s handle
twitter:title - Content title
twitter:description - Content description
twitter:image - Image URL
Schema.org
name - Person or organization name
description - Profile description
image - Profile image
url - Website URL
Memory Usage
Metadata extraction is memory-efficient as it:
- Filters out unnecessary meta tags
- Combines duplicates
- Only processes confirmed profiles
Processing Time
Extraction adds minimal overhead:
- Metadata: ~10-50ms per profile
- Patterns: Depends on regex complexity and source size
- Total: Usually less than 100ms additional per profile
Optimization Tips
- Use with filtering: Combine with
--filter good to extract only from confirmed profiles
- Limit websites: Use
--websites or --top to reduce the number of profiles processed
- Choose wisely: Only enable extraction when you need the additional intelligence
# Optimized extraction command
node app.js --username "johndoe" --metadata --extract \
--filter good --top 50
Practical Applications
OSINT Investigations
- Build comprehensive profiles across platforms
- Find hidden connections between accounts
- Identify real names and contact information
- Map social networks and relationships
Data Correlation
# Extract metadata from multiple related profiles
node app.js --username "johndoe,jdoe,john.doe" \
--metadata --extract --filter good
This helps identify:
- Shared email addresses
- Common profile pictures
- Consistent bio information
- Cross-platform links
Security Research
- Identify information leakage
- Find exposed personal data
- Map digital footprints
- Assess privacy exposure
Extracted data is available in multiple formats:
node app.js --username "johndoe" --metadata --extract --output json > output.json
node app.js --username "johndoe" --metadata --extract --output pretty
Log Files
All extraction results are automatically logged:
node app.js --username "johndoe" --metadata --extract --logs
# Results saved to: logs/[uuid]_log.txt
Be mindful of privacy and legal considerations when extracting and storing personal information. Always ensure you have proper authorization for OSINT activities.