Skip to main content

Overview

The GTM Research Engine collects evidence from multiple independent data sources in parallel, providing comprehensive intelligence about companies. Each source is optimized for specific types of information, creating a complete picture of a company’s technology stack and business activities.

Supported Data Sources

Google Search

Site-specific searches, file type filtering, Boolean queries via Tavily API

News Search

Press releases, funding news, partnerships via NewsAPI

Jobs Search

Job postings with TF-IDF semantic matching via Greenhouse API

How It Works

The research pipeline executes searches across all sources simultaneously, using async worker pools to maximize throughput while respecting rate limits.

Source Configuration

Each data source has its own semaphore pool for rate limiting:
# From pipeline.py:44-48
self.source_pools: Dict[str, asyncio.Semaphore] = {
    "google_search": asyncio.Semaphore(max_parallel_searches),   
    "jobs_search": asyncio.Semaphore(max_parallel_searches),
    "news_search": asyncio.Semaphore(max_parallel_searches),
}
Sources are initialized as reusable clients:
# From pipeline.py:52-56
self.sources: Dict[str, BaseSource] = {
    "google_search": GoogleSearchSource(),
    "jobs_search": JobsSearchSource(),
    "news_search": NewsSearchSource(),
}

Query Execution

Queries are executed concurrently across all sources and domains:
# From pipeline.py:145-152
tasks: List[asyncio.Task[Tuple[str, SourceResult]]] = []
for domain in self.company_domains:
    for strategy in self.strategies:
        tasks.append(
            asyncio.create_task(
                self._execute_one(domain, strategy, self.search_depth)
            )
        )

Source-Specific Features

Google Search (Tavily API)

Executes site-specific searches with configurable depth and result limits:
# From google_search.py:25-35
SEARCH_DEPTH_MAPPING = {
    "quick": "basic",
    "standard": "advanced", 
    "comprehensive": "advanced"
}

MAX_RESULTS_MAPPING = {
    "quick": 2,
    "standard": 3, 
    "comprehensive": 5
}
Supports advanced search patterns:
  • site:{DOMAIN} [keywords] - Search within company domain
  • site:{DOMAIN}/blog [keywords] - Target specific subdomains
  • site:{DOMAIN} filetype:pdf [keywords] - Find technical documentation
  • Boolean operators for precision queries

News Search (NewsAPI)

Searches news articles and press releases with company context:
# From news_search.py:49-61
company_name = domain.split('.')[0]
search_query = f'"{company_name}" AND ({query})'

data = await asyncio.to_thread(
    self.client.get_everything,
    q=search_query,
    sort_by="relevancy",
    language="en",
    page_size=self.PAGE_SIZE_MAPPING.get(search_depth, 3)
)
Ideal for finding:
  • Funding announcements
  • Partnership press releases
  • Security incidents
  • Product launches

Jobs Search (Greenhouse API)

Uses TF-IDF vectorization for semantic matching of job descriptions:
# From jobs_search.py:24-35
self._tfidf_vectorizer = TfidfVectorizer(
    max_features=2000,
    stop_words='english',
    ngram_range=(1, 3),  # Include trigrams for phrase matching
    lowercase=True,
    min_df=1,
    max_df=0.95,
    token_pattern=r'\\b[a-zA-Z]+\\b'
)
Scores job postings based on content similarity:
# From jobs_search.py:73-82
similarities = cosine_similarity(search_vector, job_vectors)[0]

matches = []
for job, similarity in zip(jobs, similarities):
    if similarity >= threshold:
        matches.append((job, float(similarity)))

return sorted(matches, key=lambda x: x[1], reverse=True)

Evidence Deduplication

Redis-based deduplication ensures unique evidence across all sources:
# From google_search.py:71-82
is_cached = await asyncio.to_thread(
    redis_client.is_evidence_cached, domain, evidence
)
if is_cached:
    continue

await asyncio.to_thread(
    redis_client.add_evidence_to_cache, domain, evidence
)

evidences.append(evidence)
Deduplication happens at the evidence level, not the source level, ensuring all unique information is captured even if multiple sources return similar URLs.

Making a Multi-Source Request

curl -X POST http://localhost:8000/research/batch \
  -H "Content-Type: application/json" \
  -d '{
    "research_goal": "Find fintech companies using AI for fraud detection",
    "company_domains": ["stripe.com", "paypal.com"],
    "search_depth": "standard",
    "max_parallel_searches": 20,
    "confidence_threshold": 0.7
  }'

Response Format

{
  "research_id": "uuid",
  "results": [
    {
      "domain": "stripe.com",
      "confidence_score": 0.92,
      "evidence_sources": 3,
      "findings": {
        "technologies": ["TensorFlow", "Python", "Kubernetes"],
        "evidence": [
          {
            "url": "https://stripe.com/blog/fraud-detection",
            "title": "Building AI-Powered Fraud Detection",
            "snippet": "We use TensorFlow...",
            "source_name": "google_search"
          },
          {
            "url": "https://newsapi.org/article/123",
            "title": "Stripe Expands AI Capabilities",
            "snippet": "Partnership announcement...",
            "source_name": "news_search"
          }
        ],
        "signals_found": 8
      }
    }
  ]
}

Configuration Options

Environment Variables

Configure API keys for each source:
GEMINI_API_KEY=your_gemini_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here
NEWS_API_KEY=your_news_api_key_here
All three API keys are required for full multi-source functionality. Missing keys will cause the corresponding source to fail.

Search Depth

Controls the number of results per source:
  • quick - 2 results per source (fastest, lowest cost)
  • standard - 3 results per source (balanced)
  • comprehensive - 5 results per source (most thorough)

Max Parallel Searches

Controls concurrency per source pool. Higher values increase speed but may hit rate limits:
max_parallel_searches: int = 20  # Default recommended value

Best Practices

Start with standard depth and adjust based on result quality:
  • Use quick for large batch operations (100+ domains)
  • Use comprehensive for high-value targets requiring maximum evidence
Each source has different rate limits:
  • Tavily: Based on your plan tier
  • NewsAPI: 1,000 requests/day (free tier)
  • Greenhouse: No authentication required, but respect fair use
The max_parallel_searches setting affects all sources:
# Conservative (slower, safer)
max_parallel_searches: 10

# Aggressive (faster, may hit limits)
max_parallel_searches: 50
The pipeline continues even if individual sources fail:
# From pipeline.py:158-163
for coro in asyncio.as_completed(tasks):
    domain, res = await coro
    completed_count += 1
    
    if res.ok and res.evidences:
        domain_to_evidence[domain].extend(res.evidences)

Performance Metrics

Multi-source research provides 3-5x faster results compared to sequential execution:
{
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  },
  "processing_time_ms": 3420
}

Parallel Processing

Learn how async worker pools maximize throughput

AI-Powered Analysis

Understand how evidence is analyzed with LLMs

Build docs developers (and LLMs) love