Skip to main content

Overview

Crawlith supports multiple export formats to integrate with your workflow, whether you need structured data for analysis, reports for stakeholders, or interactive visualizations for exploration.

Export Formats

JSON

Export complete graph data with all metrics and metadata:
crawlith crawl https://example.com --export json
Produces graph.json with the complete graph structure:
{
  "nodes": [
    {
      "url": "https://example.com/",
      "depth": 0,
      "status": 200,
      "inLinks": 5,
      "outLinks": 12,
      "pageRank": 0.0234,
      "pageRankScore": 87.3,
      "canonical": "https://example.com/",
      "contentHash": "sha256:abc123...",
      "simhash": "18446744073709551615",
      "wordCount": 850,
      "thinContentScore": 15.2,
      "duplicateClusterId": null,
      "clusterId": null,
      "brokenLinks": []
    }
  ],
  "edges": [
    {
      "source": "https://example.com/",
      "target": "https://example.com/about",
      "weight": 1.0
    }
  ],
  "duplicateClusters": [],
  "contentClusters": [],
  "limitReached": false,
  "sessionStats": {
    "pagesFetched": 247,
    "pagesCached": 0,
    "pagesSkipped": 3,
    "totalFound": 250
  }
}
Use cases:
  • Import into data analysis tools (Python, R)
  • Feed into custom dashboards
  • Store for historical comparison
  • Process with jq or other JSON tools

CSV

Export graph data as spreadsheet-friendly CSV files:
crawlith crawl https://example.com --export csv
Produces two files: nodes.csv:
URL,Depth,Status,InboundLinks,OutboundLinks,PageRankScore
https://example.com/,0,200,5,12,87.300
https://example.com/about,1,200,3,8,72.450
edges.csv:
Source,Target,Weight
https://example.com/,https://example.com/about,1.0
https://example.com/,https://example.com/contact,1.0
From the code:
// From crawlExport.ts:1-16
export function renderCrawlCsvNodes(graphData: any): string {
  const nodeHeaders = ['URL', 'Depth', 'Status', 'InboundLinks', 'OutboundLinks', 'PageRankScore'];
  const nodeRows = graphData.nodes.map((n: any) => {
    const outbound = graphData.edges.filter((e: any) => e.source === n.url).length;
    const inbound = graphData.edges.filter((e: any) => e.target === n.url).length;
    const statusStr = n.status === 0 ? 'Pending/Limit' : n.status;
    return [n.url, n.depth, statusStr, inbound, outbound, (n.pageRankScore || 0).toFixed(3)].join(',');
  });
  return [nodeHeaders.join(','), ...nodeRows].join('\n');
}

export function renderCrawlCsvEdges(graphData: any): string {
  const edgeHeaders = ['Source', 'Target', 'Weight'];
  const edgeRows = graphData.edges.map((e: any) => [e.source, e.target, e.weight].join(','));
  return [edgeHeaders.join(','), ...edgeRows].join('\n');
}
Use cases:
  • Import into Excel or Google Sheets
  • Create pivot tables and charts
  • Share with non-technical stakeholders
  • Quick data exploration

Markdown

Generate human-readable Markdown reports:
crawlith crawl https://example.com --export markdown
Produces summary.md:
# Crawlith Crawl Summary - https://example.com

## 📊 Metrics
- Total Pages Discovered: 250
- Session Pages Crawled: 247
- Total Edges: 1,234
- Avg Depth: 2.34
- Max Depth: 5
- Crawl Efficiency: 92.3%

## 📄 Top Pages (by In-degree)
| URL | Inbound | Status |
| :--- | :--- | :--- |
| https://example.com/ | 45 | 200 |
| https://example.com/products | 23 | 200 |
| https://example.com/about | 18 | 200 |

## 🏆 Top PageRank Pages
| URL | Score |
| :--- | :--- |
| https://example.com/ | 87.300/100 |
| https://example.com/products | 72.450/100 |
From the code:
// From crawlExport.ts:18-58
export function renderCrawlMarkdown(url: string, graphData: any, metrics: any, graph: any): string {
  const md = [
    `# Crawlith Crawl Summary - ${url}`,
    '',
    `## 📊 Metrics`,
    `- Total Pages Discovered: ${metrics.totalPages}`,
    `- Session Pages Crawled: ${graph.sessionStats?.pagesFetched ?? 0}`,
    `- Total Edges: ${metrics.totalEdges}`,
    `- Avg Depth: ${metrics.averageDepth.toFixed(2)}`,
    `- Max Depth: ${metrics.maxDepthFound}`,
    `- Crawl Efficiency: ${(metrics.crawlEfficiencyScore * 100).toFixed(1)}%`,
  ];
  // ...
}
Use cases:
  • Include in GitHub repositories (commit as documentation)
  • Convert to PDF for client reports
  • Embed in Notion, Confluence, or other wikis
  • Quick overview without opening specialized tools

HTML

Generate standalone HTML reports with embedded data:
crawlith crawl https://example.com --export html
Produces report.html with:
  • Complete graph visualization (interactive)
  • Metrics dashboard
  • Filterable tables
  • No external dependencies (works offline)
From the code:
// From html.ts:8-27
export function generateHtml(graphData: any, metrics: Metrics): string {
  // Strip heavy HTML content from nodes to keep the report lightweight
  const vizGraphData = {
    ...graphData,
    nodes: graphData.nodes ? graphData.nodes.map((n: any) => {
      const { html, ...rest } = n;
      return rest;
    }) : []
  };

  const graphJson = safeJson(vizGraphData);
  const metricsJson = safeJson(metrics);

  return Crawl_HTML.replace('</body>', `<script>
    window.GRAPH_DATA = ${graphJson};
    window.METRICS_DATA = ${metricsJson};
  </script>
  </body>`);
}
Features:
  • Self-contained (no server required)
  • Interactive graph visualization
  • Filter nodes by status, depth, or PageRank
  • Click nodes to see details
  • Export subgraphs
Use cases:
  • Share reports via email
  • Host on internal wikis or intranets
  • Archive crawl results
  • Present findings in meetings

Visualize

Export formats optimized for visualization tools:
crawlith crawl https://example.com --export visualize
Produces multiple formats:
  • Graphviz DOT: For rendering with Graphviz
  • GEXF: For Gephi network analysis
  • D3.js JSON: For custom D3 visualizations
Use cases:
  • Create custom network diagrams
  • Perform advanced graph analysis in Gephi
  • Build interactive visualizations
  • Generate site architecture diagrams

CLI Usage

Export During Crawl

# Single format
crawlith crawl https://example.com --export json

# Multiple formats (comma-separated)
crawlith crawl https://example.com --export json,csv,markdown,html

# All formats
crawlith crawl https://example.com --export json,csv,markdown,html,visualize

Export from Existing Snapshot

# Export latest completed snapshot
crawlith export https://example.com --export json,html

# Custom output directory
crawlith export https://example.com --export json --output ./reports
From the code:
// From export.ts:13-24
export const exportCmd = new Command('export')
  .description('Export latest snapshot data for a site')
  .argument('[url]', 'URL or domain of the site')
  .option('-o, --output <path>', 'Output directory', './crawlith-reports')
  .option('--export [formats]', 'Export formats (comma-separated)', 'json')
  .action(async (url, options) => {
    // Load snapshot from database
    const graph = loadGraphFromSnapshot(snapshot.id);
    const metrics = calculateMetrics(graph, maxDepth);
    
    // Export to specified formats
    await runCrawlExports(exportFormats, outputDir, url, graph.toJSON(), metrics, graph);
  });

Output Location

By default, exports are saved to:
./crawlith-reports/{domain}/
  ├── graph.json
  ├── nodes.csv
  ├── edges.csv
  ├── summary.md
  ├── report.html
  └── graph.dot
Customize the output directory:
crawlith crawl https://example.com --export json --output /path/to/reports

Export Filtering

Export data is automatically filtered to include only relevant information:
// HTML exports strip page HTML to reduce file size
const vizGraphData = {
  ...graphData,
  nodes: graphData.nodes.map((n: any) => {
    const { html, ...rest } = n;  // Remove HTML content
    return rest;
  })
};
Why this matters: Full HTML for each page can make export files very large (100MB+). Stripped exports focus on metrics and structure.
If you need the full HTML content, use the JSON export and access the database directly, or query specific pages using the crawlith page command.

Advanced Use Cases

Automated Reporting

Combine exports with CI/CD:
#!/bin/bash
# Daily crawl and report generation
crawlith crawl https://example.com --export json,html,markdown

# Upload to S3 for team access
aws s3 sync ./crawlith-reports/ s3://my-bucket/crawl-reports/$(date +%Y-%m-%d)/

# Send Markdown report via Slack
curl -X POST -H 'Content-type: application/json' \
  --data "{\"text\": \"$(cat ./crawlith-reports/example.com/summary.md)\"}" \
  $SLACK_WEBHOOK_URL

Diff Analysis

Compare exports over time:
# Crawl and export weekly
crawlith crawl https://example.com --export json --output ./reports/week1
crawlith crawl https://example.com --export json --output ./reports/week2

# Compare graphs
crawlith crawl --compare ./reports/week1/graph.json ./reports/week2/graph.json

Custom Processing

Process JSON exports with jq:
# Find all 404 pages
jq -r '.nodes[] | select(.status == 404) | .url' graph.json

# List pages with low PageRank
jq -r '.nodes[] | select(.pageRankScore < 20) | "\(.url): \(.pageRankScore)"' graph.json

# Count pages by depth
jq -r '.nodes | group_by(.depth) | .[] | "Depth \(.[0].depth): \(length) pages"' graph.json

# Find orphan pages (depth > 0, inLinks = 0)
jq -r '.nodes[] | select(.depth > 0 and .inLinks == 0) | .url' graph.json

Python Analysis

import json
import pandas as pd

# Load graph data
with open('graph.json') as f:
    data = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(data['nodes'])

# Analyze
print(f"Total pages: {len(df)}")
print(f"Average PageRank: {df['pageRankScore'].mean():.2f}")
print(f"Pages with thin content: {len(df[df['thinContentScore'] > 50])}")

# Find high-value pages (high PageRank, low depth)
high_value = df[(df['pageRankScore'] > 70) & (df['depth'] <= 2)]
print(f"High-value pages: {len(high_value)}")
print(high_value[['url', 'pageRankScore', 'depth']].head())

Export Reference

Supported Formats

FormatExtensionSizeUse Case
JSON.jsonLargeProgrammatic analysis, archival
CSV.csvMediumSpreadsheets, SQL import
Markdown.mdSmallDocumentation, reports
HTML.htmlMediumInteractive viewing, sharing
Graphviz.dotSmallNetwork diagrams

Data Completeness

FormatNodesEdgesMetricsHTML Content
JSON✓ Full✓ Full✓ Full✓ Full
CSV✓ Summary✓ Full
Markdown✓ Top 10✓ Summary
HTML✓ Full✓ Full✓ Full✗ Stripped
Graphviz✓ Full✓ Full

See Also

Graph Analysis

Understand the metrics included in exports

Incremental Crawls

Compare exported snapshots over time

CLI Reference

Complete CLI documentation for export options

Build docs developers (and LLMs) love