Understanding Results

The GTM Research Engine returns rich, structured results combining evidence from multiple sources with AI-powered analysis. This guide explains how to interpret confidence scores, evidence quality, and performance metrics.

Result Structure

Every research response follows a consistent structure:

{
  "research_id": "550e8400-e29b-41d4-a716-446655440000",
  "total_companies": 10,
  "search_strategies_generated": 8,
  "total_searches_executed": 80,
  "processing_time_ms": 45230,
  "results": [...],
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  }
}

Research Metadata

Top-level fields provide execution context:

research_id: Unique identifier for this research run
total_companies: Number of domains analyzed
search_strategies_generated: Strategies created by LLM
total_searches_executed: Total searches performed (companies × strategies)
processing_time_ms: End-to-end execution time
search_performance: Performance metrics

Company Results

Each company in the results array contains comprehensive findings:

{
  "domain": "stripe.com",
  "confidence_score": 0.87,
  "evidence_sources": 4,
  "findings": {
    "technologies": [
      "Kubernetes",
      "Docker",
      "Google Cloud Platform",
      "Istio"
    ],
    "evidence": [
      {
        "url": "https://stripe.com/blog/operating-kubernetes-at-stripe",
        "title": "Operating Kubernetes at Stripe",
        "snippet": "We use Kubernetes to orchestrate our microservices across multiple data centers...",
        "source_name": "google_search"
      }
    ],
    "signals_found": 12
  }
}

Core Fields

domain
confidence_score
evidence_sources
findings

Company identifierThe domain exactly as provided in the request.

{
  "domain": "stripe.com"
}

Used to correlate results with input data.

Match confidence (0.0-1.0)AI-calculated probability that the company matches your research goal.

{
  "confidence_score": 0.87
}

Interpretation:

0.9-1.0: Very high confidence
0.8-0.9: High confidence
0.7-0.8: Good confidence
0.6-0.7: Moderate confidence
0.0-0.6: Low confidence

Confidence scores are calculated by LLM analysis of all evidence, considering:

Evidence quality and quantity
Technology matches to research goal
Signal strength across sources
Recency and relevance of findings

Number of unique data sourcesCount of distinct sources that provided evidence.

{
  "evidence_sources": 4
}

Calculated from unique source_name values in evidence:

# From pipeline.py:192
evidence_sources = len(set(evidence.source_name for evidence in evidences))

Higher values indicate:

More comprehensive evidence
Better cross-validation
Stronger confidence basis

Structured research findingsContains technologies, evidence items, and signal count.

{
  "findings": {
    "technologies": ["Kubernetes", "Docker"],
    "evidence": [...],
    "signals_found": 12
  }
}

See sections below for detailed breakdowns.

Confidence Scoring

Confidence scores are generated by LLM analysis of all collected evidence:

# From pipeline.py:183-206
def _build_company_result(
    self, domain: str, evidences: List[Evidence], analysis_result: Dict
) -> CompanyResearchResult:
    # Extract technologies, signals, and confidence from LLM analysis
    technologies = analysis_result.get("technologies", [])
    goal_match_signals = analysis_result.get("goal_match_signals", [])
    confidence_score = analysis_result.get("confidence_score", 0.0)
    
    # Calculate evidence sources count
    evidence_sources = len(set(evidence.source_name for evidence in evidences))
    
    # Build findings with LLM results
    findings = Findings(
        technologies=technologies,
        evidence=evidences,
        signals_found=len(goal_match_signals)
    )
    
    return CompanyResearchResult(
        domain=domain,
        confidence_score=round(confidence_score, 2),
        evidence_sources=evidence_sources,
        findings=findings
    )

Scoring Factors

Evidence Quantity

More evidence items generally correlate with higher confidence:

1-3 items: Limited basis for assessment
4-8 items: Good evidence foundation
9+ items: Strong evidence base

Source Diversity

Evidence from multiple sources increases confidence:

Single source: Possible bias or limited scope
2-3 sources: Good cross-validation
4+ sources: Excellent corroboration

Technology Matches

Direct mentions of research goal technologies:

Exact technology names (“Kubernetes”)
Related technologies (“container orchestration”)
Implementation details (“production k8s cluster”)

Signal Strength

Quality of goal-match signals:

Direct statements (“We use Kubernetes…”)
Job requirements (“Kubernetes experience required”)
Case studies (“Migrating to Kubernetes”)
News announcements (“Kubernetes adoption”)

Confidence Labels

The frontend translates scores to human-readable labels:

// From ResearchResults.tsx:188-194
const getConfidenceLabel = (score: number) => {
  if (score >= 0.9) return "Very High";
  if (score >= 0.8) return "High";
  if (score >= 0.7) return "Good";
  if (score >= 0.6) return "Moderate";
  return "Low";
};

{
  "confidence_score": 0.93,
  "interpretation": "Very strong evidence across multiple sources",
  "characteristics": [
    "Multiple direct technology mentions",
    "Evidence from 4+ sources",
    "Recent and specific findings",
    "Strong goal-match signals"
  ]
}

Confidence Thresholds

The confidence_threshold parameter filters high-confidence results:

{
  "confidence_threshold": 0.7,
  "results": [...],  // All results regardless of confidence
  "high_confidence_results": [...]  // Only results >= 0.7
}

Use high_confidence_results for automated decisions, but review all results to avoid missing borderline matches.

Technologies Array

Extracted technologies are automatically identified by LLM:

{
  "technologies": [
    "Kubernetes",
    "Docker",
    "Google Cloud Platform",
    "Istio",
    "Terraform",
    "Python"
  ]
}

Extraction Process:

# From pipeline.py:186-189
technologies = analysis_result.get("technologies", [])
goal_match_signals = analysis_result.get("goal_match_signals", [])
confidence_score = analysis_result.get("confidence_score", 0.0)

The LLM analyzes all evidence and identifies:

Primary technologies: Directly related to research goal
Supporting technologies: Commonly used with primary tech
Infrastructure: Platforms, cloud providers, tools
Languages: Programming languages mentioned

Technologies are deduplicated and normalized (e.g., “K8s” → “Kubernetes”).

Technology Relevance

Not all technologies carry equal weight:

High Relevance

Technologies directly matching research goalResearch goal: “Companies using Kubernetes”High relevance:

Kubernetes
K8s
Container orchestration
kubectl

These directly indicate goal match.

Medium Relevance

Supporting or related technologiesResearch goal: “Companies using Kubernetes”Medium relevance:

Docker
Helm
Istio
Prometheus

Often used with Kubernetes, strengthen case.

Low Relevance

Tangentially related or common technologiesResearch goal: “Companies using Kubernetes”Low relevance:

Python
React
PostgreSQL
AWS

Common technologies that don’t indicate Kubernetes usage.

Evidence Structure

Evidence items are the foundation of all findings:

{
  "url": "https://stripe.com/blog/operating-kubernetes-at-stripe",
  "title": "Operating Kubernetes at Stripe",
  "snippet": "We use Kubernetes to orchestrate our microservices across multiple data centers, managing thousands of containers with automated rollouts and rollbacks.",
  "source_name": "google_search"
}

Evidence Fields

url
title
snippet
source_name

Source URLDirect link to the evidence source.Examples:

Blog posts: https://company.com/blog/kubernetes-migration
Job postings: https://company.com/careers/devops-engineer
News: https://techcrunch.com/2024/company-kubernetes
Documentation: https://company.com/docs/infrastructure.pdf

Always verify evidence by clicking through to source URLs.

Relevant excerptText excerpt containing relevant keywords.Quality indicators:

Direct technology mentions
Specific implementation details
Quantitative information
Recent dates or versions

Example high-quality snippet:

"We migrated our entire infrastructure to Kubernetes in Q2 2024, 
now running 500+ microservices across 3 GKE clusters with Istio 
service mesh and automated CI/CD via ArgoCD."

Data source identifierIndicates which search channel found this evidence.Possible values:

google_search: Company website or external web
news_search: News articles or press releases
jobs_search: Job postings

Source reliability (generally):

Company website (highest)
Job postings
News articles
External mentions

Evidence Quality

Not all evidence is equally valuable:

{
  "url": "https://company.com/blog/kubernetes-production",
  "title": "Running Kubernetes in Production: Lessons Learned",
  "snippet": "After 2 years running Kubernetes in production with 1000+ pods across 50 microservices, we've learned...",
  "source_name": "google_search",
  "quality_indicators": [
    "Direct first-person statement",
    "Specific quantitative details",
    "Production usage confirmed",
    "Recent and detailed"
  ]
}

Evidence Grouping

The frontend groups evidence by source for easier review:

// From ResearchResults.tsx:196-203
const evidenceBySource = company.findings.evidence.reduce((acc, evidence) => {
  if (!acc[evidence.source_name]) {
    acc[evidence.source_name] = [];
  }
  acc[evidence.source_name].push(evidence);
  return acc;
}, {} as Record<string, typeof company.findings.evidence>);

Results display in tabs:

google_search (12 items)
jobs_search (5 items)
news_search (3 items)

Grouping helps identify which sources provided strongest evidence and detect potential gaps.

Signals Found

Signal count indicates goal-match strength:

{
  "signals_found": 12
}

Calculated from LLM analysis:

# From pipeline.py:198
signals_found=len(goal_match_signals)

Signal Types

Signals are discrete indicators of goal match:

Company explicitly states technology useExamples:

“We use Kubernetes for container orchestration”
“Our infrastructure runs on GKE”
“Migrated to Kubernetes in 2024”

Weight: Highest confidence

Signal Interpretation

Signal Count	Interpretation
0-2	Minimal evidence, likely weak or no match
3-5	Some evidence, possible match but limited
6-9	Good evidence across multiple types
10-15	Strong evidence with diverse signal types
16+	Very strong evidence, high confidence match

Signal count alone doesn’t determine confidence. Quality matters more than quantity.

Performance Metrics

Every response includes performance data:

{
  "processing_time_ms": 45230,
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  }
}

Metrics Breakdown

processing_time_ms

Total execution time in millisecondsIncludes:

Strategy generation (~2-5s)
Evidence collection (varies by depth)
LLM analysis (~1-3s per company)

Expected ranges:

Quick depth: 15,000-30,000ms (15-30s)
Standard depth: 30,000-60,000ms (30-60s)
Comprehensive: 60,000-120,000ms (1-2min)

Higher max_parallel_searches reduces processing time but may increase errors.

queries_per_second

Search execution throughputCalculated from:

# From response.py (SearchPerformance)
queries_per_second = total_searches / processing_time_seconds

Typical values:

8-12 QPS: Good performance
13-20 QPS: Excellent performance
Less than 8 QPS: Possible bottlenecks

Factors affecting QPS:

max_parallel_searches setting
Source API response times
Network latency
Rate limiting

failed_requests

Count of failed searchesIncremented when:

Source API errors
Network timeouts
Circuit breaker trips
Invalid responses

Acceptable levels:

0-2%: Normal (transient errors)
3-5%: Monitor (possible issues)
5%: Investigate (systematic problems)

Example:

{
  "total_searches_executed": 80,
  "failed_requests": 2,
  "failure_rate": 0.025  // 2.5%
}

High failure rates indicate configuration issues or source API problems.

Empty Results

Companies without evidence receive empty results:

# From pipeline.py:208-221
def _build_empty_result(self, domain: str) -> CompanyResearchResult:
    findings = Findings(
        technologies=[],
        evidence=[],
        signals_found=0
    )
    
    return CompanyResearchResult(
        domain=domain,
        confidence_score=0.0,
        evidence_sources=0,
        findings=findings
    )

Result structure:

{
  "domain": "example.com",
  "confidence_score": 0.0,
  "evidence_sources": 0,
  "findings": {
    "technologies": [],
    "evidence": [],
    "signals_found": 0
  }
}

Common reasons:

No Public Information

Company has limited online presence or documentation.

Incorrect Domain

Domain doesn’t match actual company website.

Technology Not Used

Company genuinely doesn’t use the researched technology.

Search Strategies Missed Evidence

Evidence exists but strategies didn’t find it (try comprehensive depth).

Empty results are still valuable - they indicate companies that likely don’t match your criteria.

Frontend Result Display

The web interface presents results with visual hierarchy:

// From ResearchResults.tsx:212-252
<Card elevation={2}>
  {/* Company Header */}
  <Typography variant="h6">{company.domain}</Typography>
  
  {/* Confidence Chips */}
  <Chip label={`${(company.confidence_score * 100).toFixed(0)}% Confidence`} />
  <Chip label={getConfidenceLabel(company.confidence_score)} />
  <Chip label={`${company.evidence_sources} Sources`} />
  
  {/* Technologies */}
  {company.findings.technologies.map((tech) => (
    <Chip label={tech} variant="outlined" />
  ))}
  
  {/* Evidence Tabs */}
  <Tabs>
    {sourceNames.map((sourceName) => (
      <Tab label={sourceName} />
    ))}
  </Tabs>
  
  {/* Summary */}
  <Typography>
    {company.findings.signals_found} signals detected
    across {company.evidence_sources} data sources
  </Typography>
</Card>

Visual Indicators

Confidence Colors
Technology Chips
Evidence Tabs
Expandable Cards

// From ResearchResults.tsx:182-186
const getConfidenceColor = (score: number) => {
  if (score >= 0.8) return "success";  // Green
  if (score >= 0.6) return "warning";  // Orange
  return "error";  // Red
};

Visual feedback:

🟢 Green: High confidence (≥0.8)
🟠 Orange: Moderate (0.6-0.8)
🔴 Red: Low (less than 0.6)

Technologies display as outlined chips:

<Chip 
  label={tech} 
  size="small" 
  variant="outlined"
  sx={{ backgroundColor: "#f8f9fa" }}
/>

Provides quick technology overview without expanding.

Streaming Results

Streaming mode provides incremental results:

Event Types

{
  "type": "pipeline_start",
  "message": "Starting evidence collection",
  "domains": ["stripe.com", "square.com"],
  "total_strategies": 8,
  "timestamp": 1709856234.567
}

Process domain_analyzed events to display results incrementally as they complete, improving perceived performance.

Best Practices

Set Appropriate Confidence Thresholds

Use case-specific thresholds:

Automated filtering: 0.8-0.9 (high precision)
Human review: 0.6-0.7 (balanced recall/precision)
Exploratory research: 0.5-0.6 (high recall)
All results: 0.0 (no filtering)

Always review borderline cases manually.

Validate Evidence Quality

Click through to source URLs:

Verify evidence is current
Check context around snippets
Assess source reliability
Look for implementation details

Don’t rely solely on confidence scores.

Consider Source Diversity

Higher evidence_sources = stronger findings:

Single source: Verify independently
2-3 sources: Good corroboration
4+ sources: Strong validation

Multiple sources reduce false positives.

Interpret Technologies in Context

Not all technologies equally relevant:

Focus on research goal matches
Consider supporting technologies
Ignore common/generic tech

Review actual evidence for technology context.

Monitor Performance Metrics

Track metrics over time:

Processing time trends
Failure rate patterns
QPS variations

Identify and resolve performance issues early.

Troubleshooting

Low Confidence Scores Across All Results

Possible causes:

Research goal too specific
Technologies not widely publicized
Wrong company domains

Solutions:

Broaden research criteria
Try comprehensive search depth
Verify domain accuracy
Review evidence manually

High Confidence But Irrelevant Technologies

Possible causes:

Research goal ambiguity
LLM misinterpreting goal
Related but different technologies

Solutions:

Make research goal more specific
Review actual evidence snippets
Adjust filtering logic

Many Empty Results

Possible causes:

Invalid domains
Limited public information
Search strategies not finding evidence

Solutions:

Validate domain list
Increase search depth
Try different research goal phrasing
Check sample companies manually

Inconsistent Evidence Quality

Possible causes:

Mixed source reliability
Broad search strategies
Varied company documentation

Solutions:

Filter by source type
Increase confidence threshold
Review high-confidence results only
Validate manually

Next Steps

Running Research

Learn execution modes and parameters

Search Strategies

Understand strategy generation and optimization

Setup

Usage

Advanced

Understanding Results

Understanding Results

Result Structure

Company Results

Core Fields

Confidence Scoring

Scoring Factors

Confidence Labels

Confidence Thresholds

Technologies Array

Technology Relevance

Evidence Structure

Evidence Fields

Evidence Quality

Evidence Grouping

Signals Found

Signal Types

Signal Interpretation

Performance Metrics

Metrics Breakdown

Empty Results

Frontend Result Display

Visual Indicators

Streaming Results

Event Types

Best Practices

Troubleshooting

Next Steps

Running Research

Search Strategies

Build docs developers (and LLMs) love

Setup

Usage

Advanced

​Understanding Results

​Result Structure

​Company Results

​Core Fields

​Confidence Scoring

​Scoring Factors

​Confidence Labels

​Confidence Thresholds

​Technologies Array

​Technology Relevance

​Evidence Structure

​Evidence Fields

​Evidence Quality

​Evidence Grouping

​Signals Found

​Signal Types

​Signal Interpretation

​Performance Metrics

​Metrics Breakdown

​Empty Results

​Frontend Result Display

​Visual Indicators

​Streaming Results

​Event Types

​Best Practices

​Troubleshooting

​Next Steps

Running Research

Search Strategies

Build docs developers (and LLMs) love

Understanding Results

Result Structure

Company Results

Core Fields

Confidence Scoring

Scoring Factors

Confidence Labels

Confidence Thresholds

Technologies Array

Technology Relevance

Evidence Structure

Evidence Fields

Evidence Quality

Evidence Grouping

Signals Found

Signal Types

Signal Interpretation

Performance Metrics

Metrics Breakdown

Empty Results

Frontend Result Display

Visual Indicators

Streaming Results

Event Types

Best Practices

Troubleshooting

Next Steps