Multi-Source Research

Overview

The GTM Research Engine collects evidence from multiple independent data sources in parallel, providing comprehensive intelligence about companies. Each source is optimized for specific types of information, creating a complete picture of a company’s technology stack and business activities.

Supported Data Sources

Google Search

Site-specific searches, file type filtering, Boolean queries via Tavily API

News Search

Press releases, funding news, partnerships via NewsAPI

Jobs Search

Job postings with TF-IDF semantic matching via Greenhouse API

How It Works

The research pipeline executes searches across all sources simultaneously, using async worker pools to maximize throughput while respecting rate limits.

Source Configuration

Each data source has its own semaphore pool for rate limiting:

# From pipeline.py:44-48
self.source_pools: Dict[str, asyncio.Semaphore] = {
    "google_search": asyncio.Semaphore(max_parallel_searches),   
    "jobs_search": asyncio.Semaphore(max_parallel_searches),
    "news_search": asyncio.Semaphore(max_parallel_searches),
}

Sources are initialized as reusable clients:

# From pipeline.py:52-56
self.sources: Dict[str, BaseSource] = {
    "google_search": GoogleSearchSource(),
    "jobs_search": JobsSearchSource(),
    "news_search": NewsSearchSource(),
}

Query Execution

Queries are executed concurrently across all sources and domains:

# From pipeline.py:145-152
tasks: List[asyncio.Task[Tuple[str, SourceResult]]] = []
for domain in self.company_domains:
    for strategy in self.strategies:
        tasks.append(
            asyncio.create_task(
                self._execute_one(domain, strategy, self.search_depth)
            )
        )

Source-Specific Features

Google Search (Tavily API)

Executes site-specific searches with configurable depth and result limits:

# From google_search.py:25-35
SEARCH_DEPTH_MAPPING = {
    "quick": "basic",
    "standard": "advanced", 
    "comprehensive": "advanced"
}

MAX_RESULTS_MAPPING = {
    "quick": 2,
    "standard": 3, 
    "comprehensive": 5
}

Supports advanced search patterns:

site:{DOMAIN} [keywords] - Search within company domain
site:{DOMAIN}/blog [keywords] - Target specific subdomains
site:{DOMAIN} filetype:pdf [keywords] - Find technical documentation
Boolean operators for precision queries

News Search (NewsAPI)

Searches news articles and press releases with company context:

# From news_search.py:49-61
company_name = domain.split('.')[0]
search_query = f'"{company_name}" AND ({query})'

data = await asyncio.to_thread(
    self.client.get_everything,
    q=search_query,
    sort_by="relevancy",
    language="en",
    page_size=self.PAGE_SIZE_MAPPING.get(search_depth, 3)
)

Ideal for finding:

Funding announcements
Partnership press releases
Security incidents
Product launches

Jobs Search (Greenhouse API)

Uses TF-IDF vectorization for semantic matching of job descriptions:

# From jobs_search.py:24-35
self._tfidf_vectorizer = TfidfVectorizer(
    max_features=2000,
    stop_words='english',
    ngram_range=(1, 3),  # Include trigrams for phrase matching
    lowercase=True,
    min_df=1,
    max_df=0.95,
    token_pattern=r'\\b[a-zA-Z]+\\b'
)

Scores job postings based on content similarity:

# From jobs_search.py:73-82
similarities = cosine_similarity(search_vector, job_vectors)[0]

matches = []
for job, similarity in zip(jobs, similarities):
    if similarity >= threshold:
        matches.append((job, float(similarity)))

return sorted(matches, key=lambda x: x[1], reverse=True)

Evidence Deduplication

Redis-based deduplication ensures unique evidence across all sources:

# From google_search.py:71-82
is_cached = await asyncio.to_thread(
    redis_client.is_evidence_cached, domain, evidence
)
if is_cached:
    continue

await asyncio.to_thread(
    redis_client.add_evidence_to_cache, domain, evidence
)

evidences.append(evidence)

Deduplication happens at the evidence level, not the source level, ensuring all unique information is captured even if multiple sources return similar URLs.

Making a Multi-Source Request

curl -X POST http://localhost:8000/research/batch \
  -H "Content-Type: application/json" \
  -d '{
    "research_goal": "Find fintech companies using AI for fraud detection",
    "company_domains": ["stripe.com", "paypal.com"],
    "search_depth": "standard",
    "max_parallel_searches": 20,
    "confidence_threshold": 0.7
  }'

Response Format

{
  "research_id": "uuid",
  "results": [
    {
      "domain": "stripe.com",
      "confidence_score": 0.92,
      "evidence_sources": 3,
      "findings": {
        "technologies": ["TensorFlow", "Python", "Kubernetes"],
        "evidence": [
          {
            "url": "https://stripe.com/blog/fraud-detection",
            "title": "Building AI-Powered Fraud Detection",
            "snippet": "We use TensorFlow...",
            "source_name": "google_search"
          },
          {
            "url": "https://newsapi.org/article/123",
            "title": "Stripe Expands AI Capabilities",
            "snippet": "Partnership announcement...",
            "source_name": "news_search"
          }
        ],
        "signals_found": 8
      }
    }
  ]
}

Configuration Options

Environment Variables

Configure API keys for each source:

GEMINI_API_KEY=your_gemini_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here
NEWS_API_KEY=your_news_api_key_here

All three API keys are required for full multi-source functionality. Missing keys will cause the corresponding source to fail.

Search Depth

Controls the number of results per source:

quick - 2 results per source (fastest, lowest cost)
standard - 3 results per source (balanced)
comprehensive - 5 results per source (most thorough)

Max Parallel Searches

Controls concurrency per source pool. Higher values increase speed but may hit rate limits:

max_parallel_searches: int = 20  # Default recommended value

Best Practices

Optimize Search Depth

Start with standard depth and adjust based on result quality:

Use quick for large batch operations (100+ domains)
Use comprehensive for high-value targets requiring maximum evidence

Monitor API Rate Limits

Each source has different rate limits:

Tavily: Based on your plan tier
NewsAPI: 1,000 requests/day (free tier)
Greenhouse: No authentication required, but respect fair use

Balance Parallel Searches

The max_parallel_searches setting affects all sources:

# Conservative (slower, safer)
max_parallel_searches: 10

# Aggressive (faster, may hit limits)
max_parallel_searches: 50

Handle Source Failures Gracefully

The pipeline continues even if individual sources fail:

# From pipeline.py:158-163
for coro in asyncio.as_completed(tasks):
    domain, res = await coro
    completed_count += 1
    
    if res.ok and res.evidences:
        domain_to_evidence[domain].extend(res.evidences)

Performance Metrics

Multi-source research provides 3-5x faster results compared to sequential execution:

{
  "search_performance": {
    "queries_per_second": 12.5,
    "failed_requests": 2
  },
  "processing_time_ms": 3420
}

Parallel Processing

Learn how async worker pools maximize throughput

AI-Powered Analysis

Understand how evidence is analyzed with LLMs

Overview

Core Features

Architecture

Overview

Supported Data Sources

Google Search

News Search

Jobs Search

How It Works

Source Configuration

Query Execution

Source-Specific Features

Google Search (Tavily API)

News Search (NewsAPI)

Jobs Search (Greenhouse API)

Evidence Deduplication

Making a Multi-Source Request

Response Format

Configuration Options

Environment Variables

Search Depth

Max Parallel Searches

Best Practices

Performance Metrics

Parallel Processing

AI-Powered Analysis

Build docs developers (and LLMs) love

Overview

Core Features

Architecture

​Overview

​Supported Data Sources

Google Search

News Search

Jobs Search

​How It Works

​Source Configuration

​Query Execution

​Source-Specific Features

​Google Search (Tavily API)

​News Search (NewsAPI)

​Jobs Search (Greenhouse API)

​Evidence Deduplication

​Making a Multi-Source Request

​Response Format

​Configuration Options

​Environment Variables

​Search Depth

​Max Parallel Searches

​Best Practices

​Performance Metrics

​Related Features

Parallel Processing

AI-Powered Analysis

Build docs developers (and LLMs) love

Overview

Supported Data Sources

How It Works

Source Configuration

Query Execution

Source-Specific Features

Google Search (Tavily API)

News Search (NewsAPI)

Jobs Search (Greenhouse API)

Evidence Deduplication

Making a Multi-Source Request

Response Format

Configuration Options

Environment Variables

Search Depth

Max Parallel Searches

Best Practices

Performance Metrics

Related Features