Skip to main content

Search Strategies

Search strategies are the core of the GTM Research Engine’s intelligence. The system uses LLM-powered strategy generation to create targeted searches across multiple data sources, maximizing evidence collection while respecting rate limits.

Overview

The QueryGenerator service converts research goals into 4-13 optimized search strategies, distributed across multiple channels based on search depth.
# From query_generation.py:17-21
class QueryGenerator:
    """LLM-powered query strategy generator using Google Gemini.

    Generates 8-12 diverse strategies across channels depending on search depth.
    """

Strategy Generation Process

1

Receive Research Goal

The system receives a natural language research goal:
{
  "research_goal": "Find SaaS companies using Kubernetes in production",
  "search_depth": "standard"
}
2

LLM Strategy Generation

Google Gemini 2.5 Flash analyzes the goal and generates optimized strategies:
# From query_generation.py:162-174
async def generate_strategies(
    self, research_goal: str, search_depth: str
) -> List[QueryStrategy]:
    try:
        strategies = await self._generate_with_llm(
            research_goal, search_depth
        )
        return strategies
    except Exception as e:
        print(f"LLM generation failed: {e}")
        return []
3

Strategy Validation

Each strategy is validated for:
  • Required fields (channel, query_template)
  • Supported channels
  • Proper placeholder usage
  • Relevance score (0.0-1.0)
4

Relevance Sorting

Strategies are sorted by relevance score for optimized execution:
# From query_generation.py:217-218
strategies.sort(key=lambda s: s.relevance_score, reverse=True)
return strategies

Supported Channels

The engine supports five distinct data source channels:

Search Depth Levels

Search depth controls the number and diversity of strategies generated:
{
  "search_depth": "quick",
  "expected_strategies": "4-6",
  "distribution": {
    "google_search": 2,
    "news_search": 1,
    "jobs_search": 1
  },
  "execution_time": "15-30s per 10 companies",
  "use_case": "Preliminary research, testing, rapid screening"
}

Query Templates & Placeholders

Strategies use placeholders that are dynamically substituted for each company:

Available Placeholders

Company DomainReplaced with the full company domain (e.g., stripe.com).Usage:
# Template
"site:{DOMAIN} kubernetes production"

# Result for stripe.com
"site:stripe.com kubernetes production"
Common Patterns:
  • site:{DOMAIN} - Search within domain
  • site:{DOMAIN}/blog - Search company blog
  • site:{DOMAIN}/careers - Search job postings
  • site:{DOMAIN} filetype:pdf - Find PDFs
  • -site:{DOMAIN} - Exclude domain from results

Placeholder Substitution

# From search.py:14-24
def build_query(self, company_domain: str) -> str:
    # jobs_search doesn't use domain/company placeholders
    if self.channel == "jobs_search":
        return self.query_template
        
    # Other channels use domain/company substitution
    company_name = company_domain.split(".")[0]
    
    return self.query_template.format(
        DOMAIN=company_domain, COMPANY_NAME=company_name
    )

Relevance Scoring

Each strategy includes a relevance score (0.0-1.0) indicating expected evidence quality:
Direct, specific searches most likely to find strong evidenceExamples:
  • site:{DOMAIN}/blog kubernetes production (0.95)
  • site:{DOMAIN}/careers kubernetes engineer (0.93)
  • site:{DOMAIN} filetype:pdf kubernetes architecture (0.91)
These strategies target exact locations where evidence is likely to exist.
Targeted searches with good probability of relevant resultsExamples:
  • {COMPANY_NAME} AND kubernetes migration news (0.85)
  • Jobs search: "kubernetes devops engineer" (0.88)
  • site:linkedin.com/company/{COMPANY_NAME} kubernetes (0.80)
These strategies have a clear connection to the research goal.
Broader searches that may find supporting evidenceExamples:
  • "{COMPANY_NAME}" case study kubernetes -site:{DOMAIN} (0.65)
  • site:{DOMAIN} container orchestration (0.60)
  • News search: {COMPANY_NAME} cloud infrastructure (0.63)
These strategies cast a wider net for corroborating evidence.
Exploratory searches for edge cases or indirect signalsExamples:
  • "{COMPANY_NAME}" conference presentation cloud (0.45)
  • site:{DOMAIN} microservices (0.42)
  • External: "{COMPANY_NAME}" technology stack (0.40)
These strategies explore tangential evidence.

Scoring Factors

The LLM considers multiple factors when assigning relevance scores:
# From query_generation.py:115-120 (system instruction)
"""Score each strategy based on:
1. Specificity: More specific queries score higher than broad ones
2. Evidence Quality: Strategies likely to find direct evidence score higher
3. Source Reliability: Company websites and job postings often score higher than general web searches
4. Keyword Relevance: Strategies using exact research goal keywords score higher"""
Strategies are executed in descending relevance order, prioritizing high-value searches first.

Strategy Validation

All generated strategies undergo validation before execution:
# From query_generation.py:222-240
def _is_valid_strategy(self, strategy_data: dict, required_fields: set, supported_channels: set) -> bool:
    # Fast set membership checks
    if not required_fields.issubset(strategy_data.keys()):
        return False
    
    if strategy_data["channel"] not in supported_channels:
        return False
    
    # jobs_search doesn't need domain/company placeholders
    channel = strategy_data["channel"]
    template = strategy_data["query_template"]
    
    if channel == "jobs_search":
        return True  # Any template is valid for jobs_search
    else:
        # Other channels require domain/company placeholders
        return "{DOMAIN}" in template or "{COMPANY_NAME}" in template
Validation Checks:
1

Required Fields

  • channel: Must be present
  • query_template: Must be present
  • relevance_score: Optional (defaults to 1.0)
2

Supported Channels

Channel must be one of:
  • google_search
  • news_search
  • jobs_search
3

Placeholder Requirements

  • Jobs search: No placeholders required
  • Other channels: Must include {DOMAIN} OR {COMPANY_NAME}
Invalid strategies are silently filtered out. Monitor generated strategy counts to detect validation issues.

Example Strategy Sets

Here are real-world examples of generated strategy sets:

Example 1: Kubernetes in Production

{
  "research_goal": "Find companies using Kubernetes in production",
  "search_depth": "standard"
}

Example 2: AI for Fraud Detection

{
  "research_goal": "Find fintech companies using AI for fraud detection",
  "search_depth": "comprehensive"
}

Strategy Execution

Once generated, strategies are executed in parallel across all companies:
# From pipeline.py:145-152
tasks: List[asyncio.Task[Tuple[str, SourceResult]]] = []
for domain in self.company_domains:
    for strategy in self.strategies:
        tasks.append(
            asyncio.create_task(
                self._execute_one(domain, strategy, self.search_depth)
            )
        )
Execution Flow:
1

Task Creation

Create async task for each domain × strategy combination:
  • 10 companies × 8 strategies = 80 parallel tasks
2

Rate Limiting

Tasks acquire semaphore before executing:
async with source_pool:
    result = await source.fetch(domain, query, search_depth)
3

Circuit Breaker Check

Verify circuit breaker allows request:
if not breaker.allow_request():
    return SourceResult(ok=False, error="circuit open")
4

Evidence Collection

Execute search and collect evidence:
  • Success: Record metrics, update breaker
  • Failure: Log error, increment failure count

Customizing Strategies

While strategies are auto-generated, you can influence them through:

Research Goal Specificity

{
  "research_goal": "Companies using cloud technologies"
}
Result: Generic strategies across many cloud services
  • Less targeted searches
  • More diverse results
  • Lower precision
More specific research goals produce more targeted strategies with higher relevance scores.

Search Depth Selection

# Quick: Focus on highest-yield sources
{
  "search_depth": "quick",
  # Generates: Company website, jobs, news (4-6 strategies)
}

# Standard: Balanced coverage
{
  "search_depth": "standard",
  # Generates: All sources, moderate depth (7-10 strategies)
}

# Comprehensive: Maximum evidence
{
  "search_depth": "comprehensive",
  # Generates: All sources, maximum depth (11-13 strategies)
}

Performance Considerations

Strategy Count vs. Execution Time

Search DepthStrategies10 Companies50 Companies100 Companies
Quick4-615-30s45-90s90-180s
Standard7-1030-60s90-180s180-360s
Comprehensive11-1360-120s180-360s360-720s
Actual times vary based on:
  • max_parallel_searches setting
  • Network latency
  • Source API response times
  • Rate limiting

LLM Generation Time

Strategy generation typically takes 2-5 seconds:
# From routes.py:28-35
query_generator = QueryGenerator()
strategies = await query_generator.generate_strategies(
    research_goal=payload.research_goal,
    search_depth=payload.search_depth,
)
end_time = time.time()
print(f"Query generation time: {end_time - start_time} seconds")
Strategy generation happens once per research request, regardless of company count.

Best Practices

Good:
  • “Find SaaS companies using React and Node.js”
  • “Healthcare companies implementing FHIR standards”
  • “Fintech startups with Series A funding using machine learning”
Avoid:
  • “Find good companies” (too vague)
  • “Companies” (no criteria)
  • “Tech startups in SF” (missing technology focus)
  • Testing/Development: Use quick
  • Production Research: Use standard
  • Due Diligence: Use comprehensive
  • Large Batches: Consider quick or standard
Check the search_strategies_generated field in responses:
{
  "search_strategies_generated": 8,
  "total_searches_executed": 80
}
  • Low counts may indicate validation failures
  • Review LLM errors in logs
  • Verify research goal clarity
  • Company websites: Direct evidence, high reliability
  • Jobs search: Technology validation, hiring signals
  • News search: Events, announcements, partnerships
  • External web: Third-party validation
  • Professional networks: Team expertise, growth

Troubleshooting

Expected: 4-13 strategies based on depthIf receiving < 4:
  • Check LLM response logs
  • Verify API key is valid
  • Review research goal clarity
  • Ensure no network issues
Solution: Add logging to see LLM response
Possible causes:
  • Research goal doesn’t match reality
  • Companies don’t have public evidence
  • Queries too specific
Solutions:
  • Broaden research criteria
  • Try different search depth
  • Verify company domains are correct
  • Check sample strategies for quality
Check for:
  • Invalid channel names
  • Missing placeholders
  • Malformed query templates
Solution: Review validation logs, adjust LLM temperature if needed

Next Steps

Running Research

Execute research with optimized strategies

Understanding Results

Interpret evidence and confidence scores

Build docs developers (and LLMs) love