Metadata Extraction

Social Analyzer can extract structured metadata and patterns from detected profiles to gather intelligence about the profile owner. This information is crucial for OSINT investigations and profile correlation.

Overview

The extraction module (extraction.js) provides two main capabilities:

Metadata Extraction: Extracts meta tags from profile HTML
Pattern Extraction: Finds specific patterns like emails, phones, and links

Both features are optional and activated with command-line flags.

Metadata extraction parses HTML meta tags to collect information about profiles, including social graph data, descriptions, images, and other structured information.

How It Works

From extraction.js:6-58, the metadata extractor:

async function extract_metadata (site, source) {
  const $ = cheerio.load(source)
  const meta = $('meta')
  const temp_metadata_list = []
  const temp_metadata_for_checking = []
  
  Object.keys(meta).forEach(function (key) {
    if (meta[key].attribs) {
      if (!strings_meta.test(JSON.stringify(meta[key].attribs))) {
        const temp_dict = {}
        
        if (meta[key].attribs.property) {
          temp_dict.property = meta[key].attribs.property
        }
        if (meta[key].attribs.name) {
          temp_dict.name = meta[key].attribs.name
        }
        if (meta[key].attribs.itemprop) {
          temp_dict.itemprop = meta[key].attribs.itemprop
        }
        if (meta[key].attribs.content) {
          temp_dict.content = meta[key].attribs.content
        }
        
        temp_metadata_list.push(temp_dict)
      }
    }
  })
  
  return temp_metadata_list
}

Filtered Meta Tags

The extractor filters out technical meta tags using regex patterns:

const strings_meta = new RegExp(
  'regionsAllowed|width|height|color|rgba\\(|charset|viewport|refresh|equiv', 
  'i'
)

This ensures only meaningful metadata is extracted.

Supported Meta Tag Types

The extractor handles three types of meta tags:

Property-based (Open Graph, Facebook)

<meta property="og:title" content="John Doe">
<meta property="og:image" content="https://...">

Name-based (Twitter, standard meta)

<meta name="description" content="Developer and designer">
<meta name="twitter:card" content="summary">

Itemprop-based (Schema.org)

<meta itemprop="name" content="John Doe">
<meta itemprop="description" content="...">

Duplicate Handling

From extraction.js:34-45, the extractor combines duplicate metadata:

['property', 'name', 'itemprop'].forEach((item, i) => {
  if (temp_dict[item]) {
    temp_metadata_list.forEach((_item, i) => {
      if (_item[item]) {
        if (_item[item] === temp_dict[item]) {
          // Combine duplicate entries
          temp_metadata_list[i].content += ', ' + temp_dict.content
          add = false
        }
      }
    })
  }
})

This prevents redundant metadata entries while preserving multiple values.

Usage

# Enable metadata extraction
node app.js --username "johndoe" --metadata

# Metadata with specific websites
node app.js --username "johndoe" --metadata --websites "twitter facebook"

# Metadata with top-ranked sites
node app.js --username "johndoe" --metadata --top 100

Example Output

{
  "username": "johndoe",
  "link": "https://twitter.com/johndoe",
  "status": "good",
  "metadata": [
    {
      "property": "og:title",
      "content": "John Doe (@johndoe)"
    },
    {
      "property": "og:description",
      "content": "Software Developer | Open Source Enthusiast"
    },
    {
      "property": "og:image",
      "content": "https://pbs.twimg.com/profile_images/..."
    },
    {
      "name": "twitter:card",
      "content": "summary"
    },
    {
      "name": "description",
      "content": "John Doe's profile on Twitter"
    }
  ]
}

Metadata extraction only occurs for profiles with “good” status to reduce processing time and focus on confirmed matches.

Pattern Extraction

Pattern extraction uses regular expressions to find specific information patterns within profile HTML source code.

How It Works

From extraction.js:60-87, the pattern extractor:

async function extract_patterns (site, source) {
  const temp_patterns_list = []
  const temp_patterns_for_checking = []
  
  if ('extract' in site) {
    site.extract.forEach((item, i) => {
      const regex_pattern = new RegExp(item.regex, 'g')
      let found = null
      
      while (found = regex_pattern.exec(source)) {
        if (!temp_patterns_for_checking.includes(found[1])) {
          temp_patterns_for_checking.push(found[1])
          
          if (item.type === 'link') {
            found[1] = decodeURIComponent(found[1])
          }
          
          temp_patterns_list.push({
            type: item.type,
            matched: found[1]
          })
        }
      }
    })
  }
  
  return temp_patterns_list
}

Pattern Types

Each website in the detection database can define custom extraction patterns:

Email addresses
Phone numbers
Social media links
Website URLs
User IDs
Custom patterns

Configuration Format

Patterns are configured per website in sites.json:

{
  "url": "https://example.com/{username}",
  "extract": [
    {
      "type": "email",
      "regex": "([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,})"
    },
    {
      "type": "phone",
      "regex": "\\+?[0-9]{1,3}[-.\\s]?[0-9]{3}[-.\\s]?[0-9]{4}"
    },
    {
      "type": "link",
      "regex": "href=\"(https?://[^\"]+)\""
    }
  ]
}

URL Decoding

Links are automatically URL-decoded to make them human-readable:

if (item.type === 'link') {
  found[1] = decodeURIComponent(found[1])
}

Usage

# Enable pattern extraction
node app.js --username "johndoe" --extract

# Extract both metadata and patterns
node app.js --username "johndoe" --metadata --extract

# With specific sites
node app.js --username "johndoe" --extract --websites "linkedin github"

Example Output

{
  "username": "johndoe",
  "link": "https://github.com/johndoe",
  "status": "good",
  "extracted": [
    {
      "type": "email",
      "matched": "[email protected]"
    },
    {
      "type": "link",
      "matched": "https://johndoe.com"
    },
    {
      "type": "link",
      "matched": "https://twitter.com/johndoe"
    }
  ]
}

Pattern extraction is particularly useful for finding cross-platform connections and building a comprehensive profile of the target.

Integration with Detection Modes

Extraction features work with both fast and slow detection modes.

Fast Mode Integration

From fast-scan.js:162-177:

if (temp_profile.status === 'good') {
  if (options.includes('ExtractPatterns')) {
    let temp_extracted_list = []
    temp_extracted_list = await extraction.extract_patterns(site, source)
    if (temp_extracted_list.length > 0) {
      temp_profile.extracted = temp_extracted_list
    }
  }
  
  if (options.includes('ExtractMetadata')) {
    let temp_metadata_list = []
    temp_metadata_list = await extraction.extract_metadata(site, source)
    if (temp_metadata_list.length > 0) {
      temp_profile.metadata = temp_metadata_list
    }
  }
}

Slow Mode Integration

From slow-scan.js:136-151, slow mode uses the same extraction logic:

if (temp_profile.status === 'good') {
  if (options.includes('ExtractPatterns')) {
    let temp_extracted_list = []
    temp_extracted_list = await extraction.extract_patterns(site, source)
    if (temp_extracted_list.length > 0) {
      temp_profile.extracted = temp_extracted_list
    }
  }
  
  if (options.includes('ExtractMetadata')) {
    let temp_metadata_list = []
    temp_metadata_list = await extraction.extract_metadata(site, source)
    if (temp_metadata_list.length > 0) {
      temp_profile.metadata = temp_metadata_list
    }
  }
}

Extraction only occurs after a profile is confirmed with “good” status to optimize performance.

Common Metadata Fields

Open Graph (Facebook)

og:title - Profile or page title
og:description - Profile bio or description
og:image - Profile picture URL
og:url - Canonical profile URL
og:type - Content type (profile, article, etc.)
og:site_name - Platform name

Twitter Cards

twitter:card - Card type (summary, player, etc.)
twitter:site - Site’s Twitter handle
twitter:creator - Content creator’s handle
twitter:title - Content title
twitter:description - Content description
twitter:image - Image URL

Schema.org

name - Person or organization name
description - Profile description
image - Profile image
url - Website URL

Performance Considerations

Memory Usage

Metadata extraction is memory-efficient as it:

Filters out unnecessary meta tags
Combines duplicates
Only processes confirmed profiles

Processing Time

Extraction adds minimal overhead:

Metadata: ~10-50ms per profile
Patterns: Depends on regex complexity and source size
Total: Usually less than 100ms additional per profile

Optimization Tips

Use with filtering: Combine with --filter good to extract only from confirmed profiles
Limit websites: Use --websites or --top to reduce the number of profiles processed
Choose wisely: Only enable extraction when you need the additional intelligence

# Optimized extraction command
node app.js --username "johndoe" --metadata --extract \
  --filter good --top 50

Practical Applications

OSINT Investigations

Build comprehensive profiles across platforms
Find hidden connections between accounts
Identify real names and contact information
Map social networks and relationships

Data Correlation

# Extract metadata from multiple related profiles
node app.js --username "johndoe,jdoe,john.doe" \
  --metadata --extract --filter good

This helps identify:

Shared email addresses
Common profile pictures
Consistent bio information
Cross-platform links

Security Research

Identify information leakage
Find exposed personal data
Map digital footprints
Assess privacy exposure

Output Formats

Extracted data is available in multiple formats:

JSON Format

node app.js --username "johndoe" --metadata --extract --output json > output.json

Pretty Format

node app.js --username "johndoe" --metadata --extract --output pretty

Log Files

All extraction results are automatically logged:

node app.js --username "johndoe" --metadata --extract --logs
# Results saved to: logs/[uuid]_log.txt

Be mindful of privacy and legal considerations when extracting and storing personal information. Always ensure you have proper authorization for OSINT activities.

Get Started

Usage

Features

Configuration

Metadata Extraction

Overview