Skip to main content

Running Research

The GTM Research Engine provides multiple ways to execute research queries. This guide covers both batch and streaming execution modes, along with configuration options and best practices.

Quick Start

1

Configure Your Research Goal

Define a clear, specific research objective that describes what you’re looking for:
{
  "research_goal": "Find SaaS companies using machine learning for fraud detection"
}
2

Specify Company Domains

Provide a list of company domains to analyze:
{
  "company_domains": ["stripe.com", "plaid.com", "affirm.com"]
}
3

Select Search Depth

Choose the appropriate search depth based on your needs:
  • quick: 4-6 strategies, fastest execution
  • standard: 7-10 strategies, balanced coverage
  • comprehensive: 11-13 strategies, maximum evidence
4

Execute the Research

Submit your request to the appropriate endpoint based on your needs.

Execution Modes

The research engine supports two execution modes: batch and streaming.

Batch Mode

Batch mode executes all research and returns complete results in a single response. Best for small datasets or when you need all results at once.
curl -X POST http://localhost:8000/research/batch \
  -H "Content-Type: application/json" \
  -d '{
    "research_goal": "Companies using Kubernetes in production",
    "company_domains": ["netflix.com", "airbnb.com", "uber.com"],
    "search_depth": "standard",
    "max_parallel_searches": 10,
    "confidence_threshold": 0.7
  }'
Batch mode is ideal for:
  • Small to medium datasets (< 50 companies)
  • Automated workflows requiring complete results
  • Scenarios where latency is less critical

Streaming Mode

Streaming mode provides real-time progress updates via Server-Sent Events (SSE). Best for large datasets or when you need live progress feedback.
import requests
import json

url = "http://localhost:8000/research/batch/stream"
payload = {
    "research_goal": "SaaS companies hiring ML engineers",
    "company_domains": ["openai.com", "anthropic.com", "cohere.ai"],
    "search_depth": "comprehensive",
    "max_parallel_searches": 15,
    "confidence_threshold": 0.75
}

with requests.post(url, json=payload, stream=True) as response:
    for line in response.iter_lines():
        if line:
            # Remove 'data: ' prefix from SSE format
            data = line.decode('utf-8')
            if data.startswith('data: '):
                data = data[6:]
            
            event = json.loads(data)
            
            if event['type'] == 'pipeline_start':
                print(f"Starting research on {len(event['domains'])} domains")
            elif event['type'] == 'evidence_progress':
                print(f"Evidence collection: {event['progress']}% complete")
            elif event['type'] == 'domain_analyzed':
                print(f"Analyzed {event['domain']}: {event['confidence']} confidence")
            elif event['type'] == 'pipeline_complete':
                print(f"Research complete!")
                results = event['results']
                break
Streaming mode provides real-time events:
  • pipeline_start: Research begins
  • evidence_progress: Evidence collection updates (25%, 50%, 75%, 100%)
  • evidence_complete: All evidence collected
  • analysis_start: LLM analysis begins
  • domain_analyzed: Individual domain completion
  • pipeline_complete: Final results with full summary

Request Parameters

All research requests require the following parameters:
Type: stringThe high-level research objective describing what you’re looking for.Examples:
  • “Find fintech companies using AI for fraud detection”
  • “SaaS companies with open engineering positions”
  • “Healthcare companies implementing blockchain technology”
Best Practices:
  • Be specific about the technology or criteria
  • Include industry context when relevant
  • Use natural language descriptions
Type: array<string>List of company domains to analyze. Domains should be in the format example.com (no protocol).Example:
["stripe.com", "square.com", "adyen.com"]
Limits:
  • Recommended: 1-100 domains per request
  • For larger datasets, use streaming mode
Type: "quick" | "standard" | "comprehensive"Controls the number of search strategies generated and executed:
  • quick: 4-6 strategies per domain
    • Fastest execution
    • Focuses on high-yield sources
    • Best for preliminary research
  • standard: 7-10 strategies per domain
    • Balanced speed and coverage
    • Diverse search types
    • Recommended for most use cases
  • comprehensive: 11-13 strategies per domain
    • Maximum evidence gathering
    • Exhaustive coverage
    • Best for critical decisions
Execution Time Comparison (10 companies):
  • Quick: ~15-30 seconds
  • Standard: ~30-60 seconds
  • Comprehensive: ~60-120 seconds
Type: integerMaximum number of concurrent search requests per source.Recommended Values:
  • Quick searches: 5-10
  • Standard searches: 10-15
  • Comprehensive searches: 15-20
Considerations:
  • Higher values = faster execution but more API load
  • Rate limits apply per data source
  • Circuit breakers prevent overload
Setting this value too high may trigger rate limits or circuit breakers. Start with 10 and increase if needed.
Type: float (0.0 - 1.0)Minimum confidence score for results. Only companies meeting or exceeding this threshold are included in high-confidence results.Recommended Thresholds:
  • 0.9-1.0: Very high confidence only (strict filtering)
  • 0.7-0.9: High confidence (balanced approach)
  • 0.5-0.7: Medium confidence (broader results)
  • 0.0-0.5: Include all findings
Example:
{
  "confidence_threshold": 0.75
}
All results are returned regardless of confidence, but results are also filtered into a high_confidence_results array based on this threshold.

Understanding the Pipeline

The research pipeline executes in two main phases:
The engine collects evidence from multiple sources in parallel:
# From pipeline.py:136-164
# Creates tasks for all domain/strategy combinations
tasks = []
for domain in company_domains:
    for strategy in strategies:
        tasks.append(
            asyncio.create_task(
                self._execute_one(domain, strategy, search_depth)
            )
        )

# Executes all searches in parallel with rate limiting
for coro in asyncio.as_completed(tasks):
    domain, result = await coro
    if result.ok and result.evidences:
        domain_to_evidence[domain].extend(result.evidences)
Key Features:
  • Parallel execution across all sources
  • Per-source rate limiting via semaphores
  • Circuit breakers prevent cascade failures
  • Automatic retry on transient errors
Progress Tracking (Streaming):
  • Updates at 25%, 50%, 75%, 100% completion
  • Reports domains with evidence found
  • Tracks total evidence collected

Frontend Usage

The web interface provides a visual way to execute research:
// From App.tsx:46-55
const onSubmit = () => {
  const settings = getSettings();
  handleSubmit({
    research_goal: searchQuery,
    company_domains: settings.companyDomains,
    search_depth: settings.searchDepth,
    max_parallel_searches: settings.maxParallelSearches,
    confidence_threshold: settings.confidenceThreshold,
  });
};

Configuration Steps

1

Enter Research Goal

Type your research objective in the main search field.
2

Configure Settings

Click the settings icon to configure:
  • Company domains (one per line)
  • Search depth (quick/standard/comprehensive)
  • Max parallel searches (5-20)
  • Confidence threshold (0.0-1.0)
3

Execute Search

Click “Start Research” to begin execution.
4

View Results

Results display with:
  • Confidence scores and labels
  • Technologies found
  • Evidence grouped by source
  • Performance metrics

Performance Optimization

Rate Limiting

The engine uses semaphores for per-source rate limiting:
# From pipeline.py:44-49
self.source_pools: Dict[str, asyncio.Semaphore] = {
    "google_search": asyncio.Semaphore(max_parallel_searches),   
    "jobs_search": asyncio.Semaphore(max_parallel_searches),
    "news_search": asyncio.Semaphore(max_parallel_searches),
}
Each source has independent rate limiting, allowing maximum parallelization across different data sources.

Circuit Breakers

Circuit breakers prevent cascade failures:
# From pipeline.py:59-65
self.breakers: Dict[str, CircuitBreaker] = {
    channel: CircuitBreaker(
        failure_threshold=settings.circuit_breaker_failures,
        reset_timeout_seconds=settings.circuit_breaker_reset_seconds,
    )
    for channel in self.sources.keys()
}
Behavior:
  • Opens after N consecutive failures
  • Blocks requests while open
  • Automatically resets after timeout
  • Prevents overloading failing services

Metrics Tracking

The engine tracks performance metrics:
# From pipeline.py:104-108
if result.ok and result.evidences:
    self.metrics.record_query(1)    
    breaker.record_success()
else:
    self.metrics.record_failure(1)
    breaker.record_failure()
Collected Metrics:
  • Total queries executed
  • Failed requests count
  • Queries per second
  • Processing time
  • Circuit breaker states

Best Practices

  • Quick: Use for preliminary research or testing
  • Standard: Default for most production use cases
  • Comprehensive: Critical decisions requiring maximum evidence
Start with max_parallel_searches: 10 and adjust based on:
  • Response times
  • Error rates
  • Data source rate limits
Monitor circuit breaker trips as indicators of excessive concurrency.
  • Start with 0.7 for balanced results
  • Increase to 0.8-0.9 for high-precision use cases
  • Decrease to 0.5-0.6 for exploratory research
Review the high_confidence_results array for filtered results.
For 20+ companies:
  • Use streaming mode for progress visibility
  • Implement proper SSE handling
  • Process results incrementally
  • Monitor heartbeats to detect disconnections
Common error scenarios:
  • Circuit breaker open (wait for reset)
  • Rate limit exceeded (reduce max_parallel_searches)
  • Invalid domains (validate before submission)
  • Network timeouts (implement retry logic)

Monitoring and Debugging

Response Metrics

Every response includes performance metrics:
{
  "processing_time_ms": 45230,
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  }
}

Logging

The pipeline logs key events:
# From routes.py:35-36
end_time = time.time()
print(f"Query generation time: {end_time - start_time} seconds")
Monitor logs for:
  • Query generation time (should be < 5s)
  • Pipeline execution time
  • Circuit breaker events
  • Rate limit warnings

Troubleshooting

Possible Causes:
  • Invalid company domains
  • Overly restrictive research goal
  • All sources failed (check circuit breakers)
Solutions:
  • Verify domain format (no https://)
  • Broaden research criteria
  • Check source availability
  • Review error logs
Possible Causes:
  • Too many domains
  • Low max_parallel_searches value
  • Network latency
  • Source API slowness
Solutions:
  • Batch large domain lists
  • Increase parallel search limit
  • Use quick search depth
  • Check source performance metrics
Possible Causes:
  • Excessive concurrency
  • Source API issues
  • Invalid search strategies
Solutions:
  • Reduce max_parallel_searches
  • Check circuit breaker status
  • Review generated strategies
  • Verify API credentials

Next Steps

Search Strategies

Learn how strategies are generated and optimized

Understanding Results

Interpret confidence scores and evidence data

Build docs developers (and LLMs) love