Skip to main content

Understanding Results

The GTM Research Engine returns rich, structured results combining evidence from multiple sources with AI-powered analysis. This guide explains how to interpret confidence scores, evidence quality, and performance metrics.

Result Structure

Every research response follows a consistent structure:
{
  "research_id": "550e8400-e29b-41d4-a716-446655440000",
  "total_companies": 10,
  "search_strategies_generated": 8,
  "total_searches_executed": 80,
  "processing_time_ms": 45230,
  "results": [...],
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  }
}
Top-level fields provide execution context:
  • research_id: Unique identifier for this research run
  • total_companies: Number of domains analyzed
  • search_strategies_generated: Strategies created by LLM
  • total_searches_executed: Total searches performed (companies Γ— strategies)
  • processing_time_ms: End-to-end execution time
  • search_performance: Performance metrics

Company Results

Each company in the results array contains comprehensive findings:
{
  "domain": "stripe.com",
  "confidence_score": 0.87,
  "evidence_sources": 4,
  "findings": {
    "technologies": [
      "Kubernetes",
      "Docker",
      "Google Cloud Platform",
      "Istio"
    ],
    "evidence": [
      {
        "url": "https://stripe.com/blog/operating-kubernetes-at-stripe",
        "title": "Operating Kubernetes at Stripe",
        "snippet": "We use Kubernetes to orchestrate our microservices across multiple data centers...",
        "source_name": "google_search"
      }
    ],
    "signals_found": 12
  }
}

Core Fields

Company identifierThe domain exactly as provided in the request.
{
  "domain": "stripe.com"
}
Used to correlate results with input data.

Confidence Scoring

Confidence scores are generated by LLM analysis of all collected evidence:
# From pipeline.py:183-206
def _build_company_result(
    self, domain: str, evidences: List[Evidence], analysis_result: Dict
) -> CompanyResearchResult:
    # Extract technologies, signals, and confidence from LLM analysis
    technologies = analysis_result.get("technologies", [])
    goal_match_signals = analysis_result.get("goal_match_signals", [])
    confidence_score = analysis_result.get("confidence_score", 0.0)
    
    # Calculate evidence sources count
    evidence_sources = len(set(evidence.source_name for evidence in evidences))
    
    # Build findings with LLM results
    findings = Findings(
        technologies=technologies,
        evidence=evidences,
        signals_found=len(goal_match_signals)
    )
    
    return CompanyResearchResult(
        domain=domain,
        confidence_score=round(confidence_score, 2),
        evidence_sources=evidence_sources,
        findings=findings
    )

Scoring Factors

1

Evidence Quantity

More evidence items generally correlate with higher confidence:
  • 1-3 items: Limited basis for assessment
  • 4-8 items: Good evidence foundation
  • 9+ items: Strong evidence base
2

Source Diversity

Evidence from multiple sources increases confidence:
  • Single source: Possible bias or limited scope
  • 2-3 sources: Good cross-validation
  • 4+ sources: Excellent corroboration
3

Technology Matches

Direct mentions of research goal technologies:
  • Exact technology names (β€œKubernetes”)
  • Related technologies (β€œcontainer orchestration”)
  • Implementation details (β€œproduction k8s cluster”)
4

Signal Strength

Quality of goal-match signals:
  • Direct statements (β€œWe use Kubernetes…”)
  • Job requirements (β€œKubernetes experience required”)
  • Case studies (β€œMigrating to Kubernetes”)
  • News announcements (β€œKubernetes adoption”)

Confidence Labels

The frontend translates scores to human-readable labels:
// From ResearchResults.tsx:188-194
const getConfidenceLabel = (score: number) => {
  if (score >= 0.9) return "Very High";
  if (score >= 0.8) return "High";
  if (score >= 0.7) return "Good";
  if (score >= 0.6) return "Moderate";
  return "Low";
};
{
  "confidence_score": 0.93,
  "interpretation": "Very strong evidence across multiple sources",
  "characteristics": [
    "Multiple direct technology mentions",
    "Evidence from 4+ sources",
    "Recent and specific findings",
    "Strong goal-match signals"
  ]
}

Confidence Thresholds

The confidence_threshold parameter filters high-confidence results:
{
  "confidence_threshold": 0.7,
  "results": [...],  // All results regardless of confidence
  "high_confidence_results": [...]  // Only results >= 0.7
}
Use high_confidence_results for automated decisions, but review all results to avoid missing borderline matches.

Technologies Array

Extracted technologies are automatically identified by LLM:
{
  "technologies": [
    "Kubernetes",
    "Docker",
    "Google Cloud Platform",
    "Istio",
    "Terraform",
    "Python"
  ]
}
Extraction Process:
# From pipeline.py:186-189
technologies = analysis_result.get("technologies", [])
goal_match_signals = analysis_result.get("goal_match_signals", [])
confidence_score = analysis_result.get("confidence_score", 0.0)
The LLM analyzes all evidence and identifies:
  • Primary technologies: Directly related to research goal
  • Supporting technologies: Commonly used with primary tech
  • Infrastructure: Platforms, cloud providers, tools
  • Languages: Programming languages mentioned
Technologies are deduplicated and normalized (e.g., β€œK8s” β†’ β€œKubernetes”).

Technology Relevance

Not all technologies carry equal weight:
Technologies directly matching research goalResearch goal: β€œCompanies using Kubernetes”High relevance:
  • Kubernetes
  • K8s
  • Container orchestration
  • kubectl
These directly indicate goal match.
Supporting or related technologiesResearch goal: β€œCompanies using Kubernetes”Medium relevance:
  • Docker
  • Helm
  • Istio
  • Prometheus
Often used with Kubernetes, strengthen case.
Tangentially related or common technologiesResearch goal: β€œCompanies using Kubernetes”Low relevance:
  • Python
  • React
  • PostgreSQL
  • AWS
Common technologies that don’t indicate Kubernetes usage.

Evidence Structure

Evidence items are the foundation of all findings:
{
  "url": "https://stripe.com/blog/operating-kubernetes-at-stripe",
  "title": "Operating Kubernetes at Stripe",
  "snippet": "We use Kubernetes to orchestrate our microservices across multiple data centers, managing thousands of containers with automated rollouts and rollbacks.",
  "source_name": "google_search"
}

Evidence Fields

Source URLDirect link to the evidence source.Examples:
  • Blog posts: https://company.com/blog/kubernetes-migration
  • Job postings: https://company.com/careers/devops-engineer
  • News: https://techcrunch.com/2024/company-kubernetes
  • Documentation: https://company.com/docs/infrastructure.pdf
Always verify evidence by clicking through to source URLs.

Evidence Quality

Not all evidence is equally valuable:
{
  "url": "https://company.com/blog/kubernetes-production",
  "title": "Running Kubernetes in Production: Lessons Learned",
  "snippet": "After 2 years running Kubernetes in production with 1000+ pods across 50 microservices, we've learned...",
  "source_name": "google_search",
  "quality_indicators": [
    "Direct first-person statement",
    "Specific quantitative details",
    "Production usage confirmed",
    "Recent and detailed"
  ]
}

Evidence Grouping

The frontend groups evidence by source for easier review:
// From ResearchResults.tsx:196-203
const evidenceBySource = company.findings.evidence.reduce((acc, evidence) => {
  if (!acc[evidence.source_name]) {
    acc[evidence.source_name] = [];
  }
  acc[evidence.source_name].push(evidence);
  return acc;
}, {} as Record<string, typeof company.findings.evidence>);
Results display in tabs:
  • google_search (12 items)
  • jobs_search (5 items)
  • news_search (3 items)
Grouping helps identify which sources provided strongest evidence and detect potential gaps.

Signals Found

Signal count indicates goal-match strength:
{
  "signals_found": 12
}
Calculated from LLM analysis:
# From pipeline.py:198
signals_found=len(goal_match_signals)

Signal Types

Signals are discrete indicators of goal match:
Company explicitly states technology useExamples:
  • β€œWe use Kubernetes for container orchestration”
  • β€œOur infrastructure runs on GKE”
  • β€œMigrated to Kubernetes in 2024”
Weight: Highest confidence

Signal Interpretation

Signal CountInterpretation
0-2Minimal evidence, likely weak or no match
3-5Some evidence, possible match but limited
6-9Good evidence across multiple types
10-15Strong evidence with diverse signal types
16+Very strong evidence, high confidence match
Signal count alone doesn’t determine confidence. Quality matters more than quantity.

Performance Metrics

Every response includes performance data:
{
  "processing_time_ms": 45230,
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  }
}

Metrics Breakdown

Total execution time in millisecondsIncludes:
  • Strategy generation (~2-5s)
  • Evidence collection (varies by depth)
  • LLM analysis (~1-3s per company)
Expected ranges:
  • Quick depth: 15,000-30,000ms (15-30s)
  • Standard depth: 30,000-60,000ms (30-60s)
  • Comprehensive: 60,000-120,000ms (1-2min)
Higher max_parallel_searches reduces processing time but may increase errors.
Search execution throughputCalculated from:
# From response.py (SearchPerformance)
queries_per_second = total_searches / processing_time_seconds
Typical values:
  • 8-12 QPS: Good performance
  • 13-20 QPS: Excellent performance
  • Less than 8 QPS: Possible bottlenecks
Factors affecting QPS:
  • max_parallel_searches setting
  • Source API response times
  • Network latency
  • Rate limiting
Count of failed searchesIncremented when:
  • Source API errors
  • Network timeouts
  • Circuit breaker trips
  • Invalid responses
Acceptable levels:
  • 0-2%: Normal (transient errors)
  • 3-5%: Monitor (possible issues)
  • 5%: Investigate (systematic problems)
Example:
{
  "total_searches_executed": 80,
  "failed_requests": 2,
  "failure_rate": 0.025  // 2.5%
}
High failure rates indicate configuration issues or source API problems.

Empty Results

Companies without evidence receive empty results:
# From pipeline.py:208-221
def _build_empty_result(self, domain: str) -> CompanyResearchResult:
    findings = Findings(
        technologies=[],
        evidence=[],
        signals_found=0
    )
    
    return CompanyResearchResult(
        domain=domain,
        confidence_score=0.0,
        evidence_sources=0,
        findings=findings
    )
Result structure:
{
  "domain": "example.com",
  "confidence_score": 0.0,
  "evidence_sources": 0,
  "findings": {
    "technologies": [],
    "evidence": [],
    "signals_found": 0
  }
}
Common reasons:
1

No Public Information

Company has limited online presence or documentation.
2

Incorrect Domain

Domain doesn’t match actual company website.
3

Technology Not Used

Company genuinely doesn’t use the researched technology.
4

Search Strategies Missed Evidence

Evidence exists but strategies didn’t find it (try comprehensive depth).
Empty results are still valuable - they indicate companies that likely don’t match your criteria.

Frontend Result Display

The web interface presents results with visual hierarchy:
// From ResearchResults.tsx:212-252
<Card elevation={2}>
  {/* Company Header */}
  <Typography variant="h6">{company.domain}</Typography>
  
  {/* Confidence Chips */}
  <Chip label={`${(company.confidence_score * 100).toFixed(0)}% Confidence`} />
  <Chip label={getConfidenceLabel(company.confidence_score)} />
  <Chip label={`${company.evidence_sources} Sources`} />
  
  {/* Technologies */}
  {company.findings.technologies.map((tech) => (
    <Chip label={tech} variant="outlined" />
  ))}
  
  {/* Evidence Tabs */}
  <Tabs>
    {sourceNames.map((sourceName) => (
      <Tab label={sourceName} />
    ))}
  </Tabs>
  
  {/* Summary */}
  <Typography>
    {company.findings.signals_found} signals detected
    across {company.evidence_sources} data sources
  </Typography>
</Card>

Visual Indicators

// From ResearchResults.tsx:182-186
const getConfidenceColor = (score: number) => {
  if (score >= 0.8) return "success";  // Green
  if (score >= 0.6) return "warning";  // Orange
  return "error";  // Red
};
Visual feedback:
  • 🟒 Green: High confidence (β‰₯0.8)
  • 🟠 Orange: Moderate (0.6-0.8)
  • πŸ”΄ Red: Low (less than 0.6)

Streaming Results

Streaming mode provides incremental results:

Event Types

{
  "type": "pipeline_start",
  "message": "Starting evidence collection",
  "domains": ["stripe.com", "square.com"],
  "total_strategies": 8,
  "timestamp": 1709856234.567
}
Process domain_analyzed events to display results incrementally as they complete, improving perceived performance.

Best Practices

Use case-specific thresholds:
  • Automated filtering: 0.8-0.9 (high precision)
  • Human review: 0.6-0.7 (balanced recall/precision)
  • Exploratory research: 0.5-0.6 (high recall)
  • All results: 0.0 (no filtering)
Always review borderline cases manually.
Click through to source URLs:
  • Verify evidence is current
  • Check context around snippets
  • Assess source reliability
  • Look for implementation details
Don’t rely solely on confidence scores.
Higher evidence_sources = stronger findings:
  • Single source: Verify independently
  • 2-3 sources: Good corroboration
  • 4+ sources: Strong validation
Multiple sources reduce false positives.
Not all technologies equally relevant:
  • Focus on research goal matches
  • Consider supporting technologies
  • Ignore common/generic tech
Review actual evidence for technology context.
Track metrics over time:
  • Processing time trends
  • Failure rate patterns
  • QPS variations
Identify and resolve performance issues early.

Troubleshooting

Possible causes:
  • Research goal too specific
  • Technologies not widely publicized
  • Wrong company domains
Solutions:
  • Broaden research criteria
  • Try comprehensive search depth
  • Verify domain accuracy
  • Review evidence manually
Possible causes:
  • Research goal ambiguity
  • LLM misinterpreting goal
  • Related but different technologies
Solutions:
  • Make research goal more specific
  • Review actual evidence snippets
  • Adjust filtering logic
Possible causes:
  • Invalid domains
  • Limited public information
  • Search strategies not finding evidence
Solutions:
  • Validate domain list
  • Increase search depth
  • Try different research goal phrasing
  • Check sample companies manually
Possible causes:
  • Mixed source reliability
  • Broad search strategies
  • Varied company documentation
Solutions:
  • Filter by source type
  • Increase confidence threshold
  • Review high-confidence results only
  • Validate manually

Next Steps

Running Research

Learn execution modes and parameters

Search Strategies

Understand strategy generation and optimization

Build docs developers (and LLMs) love