Skip to main content

Overview

Crawlith analyzes on-page SEO elements to identify optimization opportunities and potential issues. The analysis includes validation of titles, meta descriptions, H1 tags, structured data detection, and content quality assessment.

Title Analysis

Crawlith validates page titles against SEO best practices:
// From seo.ts:21-33
export function analyzeTitle($: CheerioAPI | string): TextFieldAnalysis {
  const title = cheerioObj('title').first().text().trim();
  if (!title) {
    return { value: null, length: 0, status: 'missing' };
  }

  if (title.length < 50) return { value: title, length: title.length, status: 'too_short' };
  if (title.length > 60) return { value: title, length: title.length, status: 'too_long' };
  return { value: title, length: title.length, status: 'ok' };
}

Title Status Values

StatusConditionRecommendation
missingNo <title> tagAdd a title tag to every page
too_short< 50 charactersExpand to 50-60 characters for better SERP display
too_long> 60 charactersShorten to avoid truncation in search results
duplicateSame as another pageCreate unique titles for each page
ok50-60 characters, uniqueOptimal length
Title tags are one of the most important on-page SEO elements. Google typically displays the first 50-60 characters in search results.

Meta Description Analysis

// From seo.ts:35-52
export function analyzeMetaDescription($: CheerioAPI | string): TextFieldAnalysis {
  const raw = cheerioObj('meta[name="description"]').attr('content');
  if (raw === undefined) {
    return { value: null, length: 0, status: 'missing' };
  }

  const description = raw.trim();
  if (!description) {
    return { value: '', length: 0, status: 'missing' };
  }

  if (description.length < 140) return { value: description, length: description.length, status: 'too_short' };
  if (description.length > 160) return { value: description, length: description.length, status: 'too_long' };
  return { value: description, length: description.length, status: 'ok' };
}

Meta Description Guidelines

  • Optimal length: 140-160 characters
  • Too short: Less than 140 characters (underutilizes SERP space)
  • Too long: More than 160 characters (gets truncated)
  • Missing: No meta description tag (Google generates one from content)
Missing meta descriptions allow search engines to generate their own snippets, which may not accurately represent your page or include your target keywords.

H1 Analysis

H1 tags provide page hierarchy and topical signals:
// From seo.ts:54-70
export function analyzeH1($: CheerioAPI | string, titleValue: string | null): H1Analysis {
  const h1Values = cheerioObj('h1').toArray().map((el) => cheerioObj(el).text().trim()).filter(Boolean);
  const count = h1Values.length;
  const first = h1Values[0] || null;
  const matchesTitle = Boolean(first && titleValue && normalizedText(first) === normalizedText(titleValue));

  if (count === 0) {
    return { count, status: 'critical', matchesTitle };
  }
  if (count > 1) {
    return { count, status: 'warning', matchesTitle };
  }
  return { count, status: 'ok', matchesTitle };
}

H1 Status Levels

StatusConditionIssue
criticalNo H1 tagsMissing primary heading signal
warningMultiple H1 tagsDiluted topical focus, confuses search engines
okExactly one H1Follows best practices

H1-Title Matching

The analyzer also checks if the H1 matches the title tag:
const matchesTitle = Boolean(
  first && titleValue && 
  normalizedText(first) === normalizedText(titleValue)
);
Why it matters: When H1 and title match, it reinforces topical consistency and keyword targeting.

Duplicate Detection

Crawlith detects duplicate titles and meta descriptions across your site:
// From seo.ts:72-99
export function applyDuplicateStatuses<T extends { value: string | null; status: string }>(items: T[]): T[] {
  const counts = new Map<string, number>();
  const normalizedToOriginal = new Map<string, string>();

  // First pass: count occurrences
  for (const item of items) {
    if (item.value) {
      const normalized = normalizedText(item.value);
      if (normalized) {
        counts.set(normalized, (counts.get(normalized) || 0) + 1);
      }
    }
  }

  // Second pass: apply duplicate status
  return items.map(item => {
    if (item.value) {
      const normalized = normalizedText(item.value);
      if ((counts.get(normalized) || 0) > 1) {
        return { ...item, status: 'duplicate' };
      }
    }
    return item;
  });
}
Duplicate detection uses case-insensitive comparison to catch variations like “Home Page” vs. “home page”.

Structured Data Detection

Crawlith analyzes JSON-LD structured data for Schema.org markup:
// From structuredData.ts:9-41
export function analyzeStructuredData($: CheerioAPI | string): StructuredDataResult {
  const scripts = cheerioObj('script[type="application/ld+json"]').toArray();
  if (scripts.length === 0) {
    return { present: false, types: [], valid: false };
  }

  const types = new Set<string>();
  let valid = true;

  for (const script of scripts) {
    const raw = cheerioObj(script).text().trim();
    if (!raw) {
      valid = false;
      continue;
    }

    try {
      const parsed = JSON.parse(raw);
      extractTypes(parsed, types);
    } catch {
      valid = false;
    }
  }

  return {
    present: true,
    valid,
    types: Array.from(types)
  };
}

Extracted Schema Types

Crawlith extracts @type values from JSON-LD, including:
  • Article, BlogPosting, NewsArticle
  • Product, Offer
  • Organization, LocalBusiness
  • BreadcrumbList, WebPage
  • FAQPage, HowTo, Recipe
// From structuredData.ts:43-64
function extractTypes(input: unknown, types: Set<string>): void {
  if (Array.isArray(input)) {
    input.forEach((item) => extractTypes(item, types));
    return;
  }

  if (!input || typeof input !== 'object') return;

  const maybeType = (input as Record<string, unknown>)['@type'];
  if (typeof maybeType === 'string') {
    types.add(maybeType);
  }

  // Handle @graph arrays
  const graph = (input as Record<string, unknown>)['@graph'];
  if (Array.isArray(graph)) {
    graph.forEach((item) => extractTypes(item, types));
  }
}

Thin Content Detection

Crawlith scores pages for “thin content” based on word count, text-to-HTML ratio, and uniqueness:
// From content.ts:55-69
export function calculateThinContentScore(
  content: ContentAnalysis,
  duplicationScore: number,
  weights: ThinScoreWeights = DEFAULT_WEIGHTS
): number {
  const wordScore = content.wordCount >= 300 ? 0 : 100 - Math.min(100, (content.wordCount / 300) * 100);
  const textRatioScore = content.textHtmlRatio >= 0.2 ? 0 : 100 - Math.min(100, (content.textHtmlRatio / 0.2) * 100);

  const raw =
    weights.lowWordWeight * wordScore +
    weights.ratioWeight * textRatioScore +
    weights.dupWeight * duplicationScore;

  return Math.max(0, Math.min(100, Number(raw.toFixed(2))));
}

Content Analysis Metrics

// From content.ts:3-7
export interface ContentAnalysis {
  wordCount: number;
  textHtmlRatio: number;
  uniqueSentenceCount: number;
}
wordCount: Number of words after removing scripts, styles, nav, and footer textHtmlRatio: Ratio of visible text to total HTML size uniqueSentenceCount: Number of unique sentences (deduplication check)

Scoring Weights

const DEFAULT_WEIGHTS: ThinScoreWeights = {
  lowWordWeight: 0.4,      // 40% weight on word count
  ratioWeight: 0.35,       // 35% weight on text/HTML ratio
  dupWeight: 0.25          // 25% weight on duplication
};
Score interpretation:
  • 0-25: High-quality, substantive content
  • 25-50: Moderate content, may need expansion
  • 50-75: Thin content, likely needs improvement
  • 75-100: Very thin content, critical issue
Pages with thin content (high scores) are at risk of:
  • Lower search rankings
  • Being excluded from search indexes
  • Poor user engagement and high bounce rates

CLI Usage

Run SEO Analysis

# Full crawl with SEO analysis (enabled by default)
crawlith crawl https://example.com
SEO analysis runs automatically during crawling and includes:
  • Title and meta description validation
  • H1 tag analysis
  • Structured data detection
  • Thin content scoring
  • Duplicate detection across all pages

Export SEO Data

# Export to JSON for detailed analysis
crawlith crawl https://example.com --export json

# Export to CSV for spreadsheet analysis
crawlith crawl https://example.com --export csv
The JSON export includes per-page SEO metrics:
{
  "nodes": [
    {
      "url": "https://example.com/page",
      "title": "Page Title",
      "titleLength": 55,
      "titleStatus": "ok",
      "metaDescription": "Description...",
      "h1Count": 1,
      "h1Status": "ok",
      "structuredDataTypes": ["Article", "BreadcrumbList"],
      "wordCount": 850,
      "thinContentScore": 15.3
    }
  ]
}

View SEO Summary

# View high-level insights in terminal
crawlith crawl https://example.com
The terminal output includes:
  • Pages with missing or duplicate titles
  • Pages with missing or duplicate meta descriptions
  • Pages with H1 issues (missing or multiple)
  • Pages with thin content (high scores)
  • Pages with structured data

Best Practices

Ensure each page has a unique, descriptive title between 50-60 characters. Avoid using the same title across multiple pages.
Craft unique meta descriptions (140-160 characters) that accurately summarize the page and encourage clicks from search results.
Each page should have exactly one H1 tag that clearly describes the page topic. Multiple H1s dilute topical focus.
Implement Schema.org markup (JSON-LD) for rich snippets. Common types include Article, Product, LocalBusiness, and BreadcrumbList.
Aim for at least 300 words of unique, valuable content per page. Pages with thin content (< 300 words) may struggle to rank.

Common Issues and Fixes

Duplicate Titles

Problem: Multiple pages share the same title tag Impact: Search engines can’t differentiate pages, potential ranking penalties Fix: Create unique titles that accurately describe each page’s content

Missing Meta Descriptions

Problem: Pages lack <meta name="description"> tags Impact: Search engines generate snippets from page content (may not be optimal) Fix: Write custom meta descriptions for important pages

Multiple H1 Tags

Problem: Page contains more than one H1 element Impact: Dilutes topical focus, confuses search engines about page hierarchy Fix: Use only one H1 for the main page heading, use H2-H6 for subheadings

Thin Content

Problem: Pages with very few words or low text-to-HTML ratio Impact: Perceived as low-quality by search engines, poor user experience Fix: Expand content to at least 300 words, ensure substantive value

See Also

Graph Analysis

Identify structural SEO issues like orphan pages and poor internal linking

Content Clustering

Detect keyword cannibalization and content overlap issues

Export Data

Export SEO analysis results for reporting and tracking

Build docs developers (and LLMs) love