Search Strategies

Search strategies are the core of the GTM Research Engine’s intelligence. The system uses LLM-powered strategy generation to create targeted searches across multiple data sources, maximizing evidence collection while respecting rate limits.

Overview

The QueryGenerator service converts research goals into 4-13 optimized search strategies, distributed across multiple channels based on search depth.

# From query_generation.py:17-21
class QueryGenerator:
    """LLM-powered query strategy generator using Google Gemini.

    Generates 8-12 diverse strategies across channels depending on search depth.
    """

Strategy Generation Process

Receive Research Goal

The system receives a natural language research goal:

{
  "research_goal": "Find SaaS companies using Kubernetes in production",
  "search_depth": "standard"
}

LLM Strategy Generation

Google Gemini 2.5 Flash analyzes the goal and generates optimized strategies:

# From query_generation.py:162-174
async def generate_strategies(
    self, research_goal: str, search_depth: str
) -> List[QueryStrategy]:
    try:
        strategies = await self._generate_with_llm(
            research_goal, search_depth
        )
        return strategies
    except Exception as e:
        print(f"LLM generation failed: {e}")
        return []

Strategy Validation

Each strategy is validated for:

Required fields (channel, query_template)
Supported channels
Proper placeholder usage
Relevance score (0.0-1.0)

Relevance Sorting

Strategies are sorted by relevance score for optimized execution:

# From query_generation.py:217-218
strategies.sort(key=lambda s: s.relevance_score, reverse=True)
return strategies

Supported Channels

The engine supports five distinct data source channels:

google_search
news_search
jobs_search
External Web
Professional Networks

Company Website SearchSearches within company domains for direct evidence of technologies, practices, and capabilities.Capabilities:

Site-specific searches (site:{DOMAIN})
File type filtering (filetype:pdf)
Subdomain targeting (site:{DOMAIN}/blog)
Boolean operators (AND, OR)

Example Strategies:

{
  "channel": "google_search",
  "query_template": "site:{DOMAIN} kubernetes production deployment",
  "relevance_score": 0.95
}

Best For:

Technical documentation
Blog posts and case studies
Engineering job postings
Company announcements

News & Press Release SearchSearches news databases for company announcements, funding, partnerships, and incidents.Capabilities:

Company name searches
Topic-based filtering
Date range queries
Event tracking

Example Strategies:

{
  "channel": "news_search",
  "query_template": "{COMPANY_NAME} AND (kubernetes OR container OR cloud-native)",
  "relevance_score": 0.85
}

Best For:

Funding announcements
Technology partnerships
Product launches
Security incidents

Job Posting SearchSearches job databases using TF-IDF content matching to find technology requirements in job descriptions.Capabilities:

Role-based matching
Technology stack detection
Seniority level filtering
Domain-specific searches

Example Strategies:

{
  "channel": "jobs_search",
  "query_template": "kubernetes devops engineer",
  "relevance_score": 0.90
}

Best For:

Technology stack validation
Engineering team size estimation
Skill requirements analysis
Hiring velocity signals

Jobs search uses domain for company identification but searches the query template against job descriptions using TF-IDF similarity, not requiring domain/company placeholders.

Third-Party MentionsUses google_search to find external mentions, case studies, and reviews.Capabilities:

Exclude company domain (-site:{DOMAIN})
Case study searches
Technology implementation mentions
Industry reports

Example Strategies:

{
  "channel": "google_search",
  "query_template": "\"{COMPANY_NAME}\" kubernetes case study -site:{DOMAIN}",
  "relevance_score": 0.75
}

Best For:

Customer testimonials
Third-party validation
Industry analysis
Conference presentations

LinkedIn & Professional ProfilesUses google_search to search LinkedIn and professional networks.Capabilities:

Employee profile searches
Company page updates
Job requirement analysis
Team composition insights

Example Strategies:

{
  "channel": "google_search",
  "query_template": "site:linkedin.com/in/ \"{COMPANY_NAME}\" kubernetes engineer",
  "relevance_score": 0.80
}

Best For:

Employee skill verification
Team expertise validation
Hiring patterns
Company growth signals

Search Depth Levels

Search depth controls the number and diversity of strategies generated:

{
  "search_depth": "quick",
  "expected_strategies": "4-6",
  "distribution": {
    "google_search": 2,
    "news_search": 1,
    "jobs_search": 1
  },
  "execution_time": "15-30s per 10 companies",
  "use_case": "Preliminary research, testing, rapid screening"
}

Query Templates & Placeholders

Strategies use placeholders that are dynamically substituted for each company:

Available Placeholders

{DOMAIN}
{COMPANY_NAME}
No Placeholders

Company DomainReplaced with the full company domain (e.g., stripe.com).Usage:

# Template
"site:{DOMAIN} kubernetes production"

# Result for stripe.com
"site:stripe.com kubernetes production"

Common Patterns:

site:{DOMAIN} - Search within domain
site:{DOMAIN}/blog - Search company blog
site:{DOMAIN}/careers - Search job postings
site:{DOMAIN} filetype:pdf - Find PDFs
-site:{DOMAIN} - Exclude domain from results

Company NameExtracted from domain by taking the first part before the TLD (e.g., stripe from stripe.com).Usage:

# Template  
"{COMPANY_NAME} AND kubernetes migration"

# Result for stripe.com
"stripe AND kubernetes migration"

Common Patterns:

"{COMPANY_NAME}" AND [keywords] - News searches
"{COMPANY_NAME}" case study - Customer stories
site:linkedin.com "{COMPANY_NAME}" - Employee profiles

Jobs SearchJobs search uses the template directly for TF-IDF matching against job descriptions.Usage:

# Template (no placeholders)
"kubernetes devops engineer"

# Used directly for TF-IDF similarity
# Domain used to filter jobs by company

Example Templates:

"machine learning engineer python"
"react node senior developer"
"data engineer spark kubernetes"
"devops site reliability engineer"

Placeholder Substitution

# From search.py:14-24
def build_query(self, company_domain: str) -> str:
    # jobs_search doesn't use domain/company placeholders
    if self.channel == "jobs_search":
        return self.query_template
        
    # Other channels use domain/company substitution
    company_name = company_domain.split(".")[0]
    
    return self.query_template.format(
        DOMAIN=company_domain, COMPANY_NAME=company_name
    )

Relevance Scoring

Each strategy includes a relevance score (0.0-1.0) indicating expected evidence quality:

0.9-1.0: Highest Relevance

Direct, specific searches most likely to find strong evidenceExamples:

site:{DOMAIN}/blog kubernetes production (0.95)
site:{DOMAIN}/careers kubernetes engineer (0.93)
site:{DOMAIN} filetype:pdf kubernetes architecture (0.91)

These strategies target exact locations where evidence is likely to exist.

0.7-0.9: High Relevance

Targeted searches with good probability of relevant resultsExamples:

{COMPANY_NAME} AND kubernetes migration news (0.85)
Jobs search: "kubernetes devops engineer" (0.88)
site:linkedin.com/company/{COMPANY_NAME} kubernetes (0.80)

These strategies have a clear connection to the research goal.

0.5-0.7: Medium Relevance

Broader searches that may find supporting evidenceExamples:

"{COMPANY_NAME}" case study kubernetes -site:{DOMAIN} (0.65)
site:{DOMAIN} container orchestration (0.60)
News search: {COMPANY_NAME} cloud infrastructure (0.63)

These strategies cast a wider net for corroborating evidence.

0.3-0.5: Lower Relevance

Exploratory searches for edge cases or indirect signalsExamples:

"{COMPANY_NAME}" conference presentation cloud (0.45)
site:{DOMAIN} microservices (0.42)
External: "{COMPANY_NAME}" technology stack (0.40)

These strategies explore tangential evidence.

Scoring Factors

The LLM considers multiple factors when assigning relevance scores:

# From query_generation.py:115-120 (system instruction)
"""Score each strategy based on:
1. Specificity: More specific queries score higher than broad ones
2. Evidence Quality: Strategies likely to find direct evidence score higher
3. Source Reliability: Company websites and job postings often score higher than general web searches
4. Keyword Relevance: Strategies using exact research goal keywords score higher"""

Strategies are executed in descending relevance order, prioritizing high-value searches first.

Strategy Validation

All generated strategies undergo validation before execution:

# From query_generation.py:222-240
def _is_valid_strategy(self, strategy_data: dict, required_fields: set, supported_channels: set) -> bool:
    # Fast set membership checks
    if not required_fields.issubset(strategy_data.keys()):
        return False
    
    if strategy_data["channel"] not in supported_channels:
        return False
    
    # jobs_search doesn't need domain/company placeholders
    channel = strategy_data["channel"]
    template = strategy_data["query_template"]
    
    if channel == "jobs_search":
        return True  # Any template is valid for jobs_search
    else:
        # Other channels require domain/company placeholders
        return "{DOMAIN}" in template or "{COMPANY_NAME}" in template

Validation Checks:

Required Fields

channel: Must be present
query_template: Must be present
relevance_score: Optional (defaults to 1.0)

Supported Channels

Channel must be one of:

google_search
news_search
jobs_search

Placeholder Requirements

Jobs search: No placeholders required
Other channels: Must include {DOMAIN} OR {COMPANY_NAME}

Invalid strategies are silently filtered out. Monitor generated strategy counts to detect validation issues.

Example Strategy Sets

Here are real-world examples of generated strategy sets:

Example 1: Kubernetes in Production

{
  "research_goal": "Find companies using Kubernetes in production",
  "search_depth": "standard"
}

Example 2: AI for Fraud Detection

{
  "research_goal": "Find fintech companies using AI for fraud detection",
  "search_depth": "comprehensive"
}

Strategy Execution

Once generated, strategies are executed in parallel across all companies:

# From pipeline.py:145-152
tasks: List[asyncio.Task[Tuple[str, SourceResult]]] = []
for domain in self.company_domains:
    for strategy in self.strategies:
        tasks.append(
            asyncio.create_task(
                self._execute_one(domain, strategy, self.search_depth)
            )
        )

Execution Flow:

Task Creation

Create async task for each domain × strategy combination:

10 companies × 8 strategies = 80 parallel tasks

Rate Limiting

Tasks acquire semaphore before executing:

async with source_pool:
    result = await source.fetch(domain, query, search_depth)

Circuit Breaker Check

Verify circuit breaker allows request:

if not breaker.allow_request():
    return SourceResult(ok=False, error="circuit open")

Evidence Collection

Execute search and collect evidence:

Success: Record metrics, update breaker
Failure: Log error, increment failure count

Customizing Strategies

While strategies are auto-generated, you can influence them through:

Research Goal Specificity

Broad Goal
Specific Goal

{
  "research_goal": "Companies using cloud technologies"
}

Result: Generic strategies across many cloud services

Less targeted searches
More diverse results
Lower precision

{
  "research_goal": "Companies using Google Cloud Platform with Kubernetes for microservices architecture"
}

Result: Highly targeted strategies

Specific keyword combinations
Higher precision
More relevant evidence

More specific research goals produce more targeted strategies with higher relevance scores.

Search Depth Selection

# Quick: Focus on highest-yield sources
{
  "search_depth": "quick",
  # Generates: Company website, jobs, news (4-6 strategies)
}

# Standard: Balanced coverage
{
  "search_depth": "standard",
  # Generates: All sources, moderate depth (7-10 strategies)
}

# Comprehensive: Maximum evidence
{
  "search_depth": "comprehensive",
  # Generates: All sources, maximum depth (11-13 strategies)
}

Performance Considerations

Strategy Count vs. Execution Time

Search Depth	Strategies	10 Companies	50 Companies	100 Companies
Quick	4-6	15-30s	45-90s	90-180s
Standard	7-10	30-60s	90-180s	180-360s
Comprehensive	11-13	60-120s	180-360s	360-720s

Actual times vary based on:

max_parallel_searches setting
Network latency
Source API response times
Rate limiting

LLM Generation Time

Strategy generation typically takes 2-5 seconds:

# From routes.py:28-35
query_generator = QueryGenerator()
strategies = await query_generator.generate_strategies(
    research_goal=payload.research_goal,
    search_depth=payload.search_depth,
)
end_time = time.time()
print(f"Query generation time: {end_time - start_time} seconds")

Strategy generation happens once per research request, regardless of company count.

Best Practices

Write Clear Research Goals

Good:

“Find SaaS companies using React and Node.js”
“Healthcare companies implementing FHIR standards”
“Fintech startups with Series A funding using machine learning”

Avoid:

“Find good companies” (too vague)
“Companies” (no criteria)
“Tech startups in SF” (missing technology focus)

Match Depth to Use Case

Testing/Development: Use quick
Production Research: Use standard
Due Diligence: Use comprehensive
Large Batches: Consider quick or standard

Monitor Generated Strategies

Check the search_strategies_generated field in responses:

{
  "search_strategies_generated": 8,
  "total_searches_executed": 80
}

Low counts may indicate validation failures
Review LLM errors in logs
Verify research goal clarity

Understand Channel Strengths

Company websites: Direct evidence, high reliability
Jobs search: Technology validation, hiring signals
News search: Events, announcements, partnerships
External web: Third-party validation
Professional networks: Team expertise, growth

Troubleshooting

Too Few Strategies Generated

Expected: 4-13 strategies based on depthIf receiving < 4:

Check LLM response logs
Verify API key is valid
Review research goal clarity
Ensure no network issues

Solution: Add logging to see LLM response

Strategies Not Finding Evidence

Possible causes:

Research goal doesn’t match reality
Companies don’t have public evidence
Queries too specific

Solutions:

Broaden research criteria
Try different search depth
Verify company domains are correct
Check sample strategies for quality

High Validation Failure Rate

Check for:

Invalid channel names
Missing placeholders
Malformed query templates

Solution: Review validation logs, adjust LLM temperature if needed

Next Steps

Running Research

Execute research with optimized strategies

Understanding Results

Interpret evidence and confidence scores

Setup

Usage

Advanced

Search Strategies

Search Strategies

Overview

Strategy Generation Process

Supported Channels

Search Depth Levels

Query Templates & Placeholders

Available Placeholders

Placeholder Substitution

Relevance Scoring

Scoring Factors

Strategy Validation

Example Strategy Sets

Example 1: Kubernetes in Production

Example 2: AI for Fraud Detection

Strategy Execution

Customizing Strategies

Research Goal Specificity

Search Depth Selection

Performance Considerations

Strategy Count vs. Execution Time

LLM Generation Time

Best Practices

Troubleshooting

Next Steps

Running Research

Understanding Results

Build docs developers (and LLMs) love

Setup

Usage

Advanced

​Search Strategies

​Overview

​Strategy Generation Process

​Supported Channels

​Search Depth Levels

​Query Templates & Placeholders

​Available Placeholders

​Placeholder Substitution

​Relevance Scoring

​Scoring Factors

​Strategy Validation

​Example Strategy Sets

​Example 1: Kubernetes in Production

​Example 2: AI for Fraud Detection

​Strategy Execution

​Customizing Strategies

​Research Goal Specificity

​Search Depth Selection

​Performance Considerations

​Strategy Count vs. Execution Time

​LLM Generation Time

​Best Practices

​Troubleshooting

​Next Steps

Running Research

Understanding Results

Build docs developers (and LLMs) love

Search Strategies

Overview

Strategy Generation Process

Supported Channels

Search Depth Levels

Query Templates & Placeholders

Available Placeholders

Placeholder Substitution

Relevance Scoring

Scoring Factors

Strategy Validation

Example Strategy Sets

Example 1: Kubernetes in Production

Example 2: AI for Fraud Detection

Strategy Execution

Customizing Strategies

Research Goal Specificity

Search Depth Selection

Performance Considerations

Strategy Count vs. Execution Time

LLM Generation Time

Best Practices

Troubleshooting

Next Steps