Duplicate Detection

Overview

The Lead Intelligence Engine prevents duplicate CRM entries by checking if a business URL already exists in Coda before inserting new records. This ensures your CRM stays clean and prevents wasted analysis on already-qualified leads.

Duplication check is performed after AI evaluation but before CRM insertion, so you don’t waste Groq tokens on duplicates.

How It Works

AI Evaluation Completes

The engine extracts content, retrieves RAG context, and gets AI analysis.

Duplicate Check

Calls CodaClient.fetch_row_by_url() to search Coda by Business URL.

Decision

If duplicate found: Return {"_status": "skipped", "_message": "Duplicate found in CRM"}
If new: Proceed to CodaClient.insert_row()

core.py (lines 56-61)

# Check for duplicates first
if self.coda.fetch_row_by_url(url):
    result["_status"] = "skipped"
    result["_message"] = "Duplicate found in CRM"
    return result

self.coda.insert_row(result)

Implementation Details

Coda Search Query

The fetch_row_by_url() method uses Coda’s search API:

coda_client.py (lines 38-62)

def fetch_row_by_url(self, url):
    """Checks if a row with the given Business URL already exists."""
    # Properly quote the URL for the query
    query = f'"{url}"'
    api_url = f"https://coda.io/apis/v1/docs/{self.doc_id}/tables/{self.table_id}/rows"
    params = {
        "query": f'Business URL:{query}',
        "limit": 1
    }

    response = requests.get(api_url, headers=self._get_headers(), params=params)
    response.raise_for_status()
    items = response.json().get('items', [])
    return len(items) > 0

query

string

Coda query format: column_name:"value"Example: Business URL:"https://example.com"

limit

integer

default:"1"

Only fetch 1 row since we just need to know if any match exists.

URL Matching Logic

The system performs exact string matching on the full URL:

https://example.com       ≠ http://example.com        (scheme differs)
https://example.com       ≠ https://example.com/       (trailing slash)
https://www.example.com   ≠ https://example.com        (subdomain differs)

URL matching is case-sensitive and character-exact. HTTPS://Example.com and https://example.com are treated as different URLs.

Edge Cases

URL Normalization

The engine does NOT normalize URLs before checking duplicates:

Trailing Slashes

Scenario: User analyzes both:

https://example.com
https://example.com/

Result: Both are inserted as separate records.Mitigation: Train users to use consistent URL formats, or implement normalization:

url = url.rstrip('/')  # Remove trailing slash

WWW Subdomain

Scenario: User analyzes both:

https://example.com
https://www.example.com

Result: Both inserted (even if they resolve to the same site).Mitigation: Implement URL canonicalization:

from urllib.parse import urlparse
parsed = urlparse(url)
if parsed.hostname.startswith('www.'):
    url = url.replace('www.', '', 1)

HTTP vs HTTPS

Scenario: User analyzes:

http://example.com
https://example.com

Result: Both inserted as different businesses.Mitigation: Force HTTPS:

url = url.replace('http://', 'https://')

Query Parameters

Scenario: User analyzes:

https://example.com
https://example.com?utm_source=facebook

Result: Both inserted (query params treated as part of URL).Mitigation: Strip query parameters:

from urllib.parse import urlparse, urlunparse
parsed = urlparse(url)
url = urlunparse(parsed._replace(query='', fragment=''))

Performance

API Latency

Coda search typically takes:

Average: 500-1,000ms
95th percentile: 1,500ms
Timeout: 10s (configured)

Duplicate check adds ~1s to total pipeline latency, but prevents wasted Groq tokens and keeps CRM clean.

Coda API Limits

Rate Limit: 100 requests per minute per API token
Concurrency: Up to 10 concurrent requests

For high-volume batch processing (>100 URLs/min), implement request throttling:

import time
from threading import Lock

class RateLimiter:
    def __init__(self, max_per_minute=100):
        self.max_per_minute = max_per_minute
        self.requests = []
        self.lock = Lock()
    
    def wait_if_needed(self):
        with self.lock:
            now = time.time()
            # Remove requests older than 1 minute
            self.requests = [t for t in self.requests if now - t < 60]
            
            if len(self.requests) >= self.max_per_minute:
                sleep_time = 60 - (now - self.requests[0])
                time.sleep(sleep_time)
            
            self.requests.append(now)

Error Handling

The system gracefully handles duplicate check failures:

coda_client.py (lines 60-62)

except Exception as e:
    logger.warning(f"Duplicate check failed (Coda API): {e}")
    return False  # Assume not duplicate if check fails

If the duplicate check fails (network error, Coda API down), the engine assumes no duplicate and proceeds with insertion. This prevents valid leads from being lost due to transient errors.

Common Errors

401 Unauthorized

Cause: Invalid CODA_API_TOKEN in .env.Fix: Regenerate token at coda.io/account.

404 Not Found

Cause: Invalid CODA_DOC_ID or CODA_TABLE_ID.Fix: Get correct IDs from Coda table URL:

https://coda.io/d/<DOC_ID>/table/<TABLE_ID>

Timeout

Cause: Coda API slow or unresponsive.Behavior: After 10s timeout, duplicate check returns False (not duplicate).Fix: Retry the URL later, or increase timeout in coda_client.py.

Monitoring Duplicates

CLI Output

When a duplicate is detected:

python main.py https://example.com

Analyzing https://example.com...

--- Evaluation Result ---
{
  "business_name": "Example Business",
  "business_type": "E-commerce",
  "primary_service": "Foundation Package",
  "fit_score": 82,
  "reasoning": "Small online store needs website."
}
------------------------

SKIPPED CRM INSERTION: https://example.com
Reason: Duplicate found in CRM

The AI evaluation still runs and displays results, but no CRM insertion occurs. This lets you verify the analysis even for duplicates.

Telegram Bot Output

The bot displays a different message for duplicates:

URL already exists in CRM:
https://example.com

Duplicate found in CRM

Manual Duplicate Resolution

If you need to re-analyze an existing lead:

Delete from Coda

Manually delete the row in your Coda table.

Re-run Analysis

python main.py https://example.com

The URL will now be treated as new and inserted.

If you want to update an existing lead without deleting, use Coda’s API to update the row directly instead of re-analyzing through the engine.

Batch Deduplication

For processing large URL lists with duplicates:

# Filter URLs before processing
cat urls.txt | while read url; do
  python main.py "$url" 2>&1 | grep -q "SKIPPED" || echo "$url processed"
done

Or use Python for pre-filtering:

batch_process.py

from core import LeadEngine

engine = LeadEngine()
urls = open('urls.txt').read().splitlines()

for url in urls:
    try:
        # Check duplicate first (no AI call)
        if engine.coda.fetch_row_by_url(url):
            print(f"SKIP: {url} (duplicate)")
            continue
        
        # Process new URLs
        result = engine.process_url(url)
        print(f"SUCCESS: {url}")
    except Exception as e:
        print(f"ERROR: {url} - {e}")

CLI Usage Guide

Learn batch processing patterns and automation

Coda Column Requirement

For duplicate detection to work, your Coda table must have a column named exactly:

Business URL

Column name is case-sensitive. “Business url” or “business_url” will NOT work.

Coda Table Setup

When creating your Coda table, ensure these columns exist:

Column Name	Type	Required
Business URL	Text	Yes
Business Name	Text	Yes
Business Type	Text	Yes
Primary Service	Text	Yes
Secondary Service	Text	No
Fit Score	Number	Yes
Reasoning	Text	Yes
Outreach Angle	Text	Yes

Coda Integration Guide

Complete setup instructions for Coda CRM

Best Practices

URL Consistency

Train users to always use the same URL format:

✅ Always include https://
✅ Remove www. prefix
✅ Remove trailing slashes
✅ Remove query parameters

Or implement normalization in extractor.py before processing.

Bulk Upload

When importing historical leads, add them directly to Coda instead of processing through the engine. This populates the Business URL column for duplicate detection.

URL Validation

Validate URLs before analysis:

from urllib.parse import urlparse

def is_valid_url(url):
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except:
        return False

Re-qualification Period

Consider adding a “Last Analyzed” date column in Coda. Re-analyze leads after 6-12 months to detect changes in digital maturity.

Future Enhancements

Potential improvements to duplicate detection:

Fuzzy Matching: Detect similar URLs with minor differences
Domain-Level Deduplication: Treat blog.example.com and shop.example.com as same business
Business Name Matching: If URL changes but business name matches, flag as potential duplicate
Canonical URL Resolution: Follow redirects to find true destination before checking

Next Steps

Coda Integration

Complete guide to Coda setup and troubleshooting

CodaClient API

Programmatic usage of CRM functions

Architecture

How duplicate detection fits in the pipeline

CLI Usage

Batch processing with duplicate handling

Get Started

Core Concepts

Usage Guides

Knowledge Base

Overview

How It Works

Implementation Details

Coda Search Query

URL Matching Logic

Edge Cases

URL Normalization

Performance

API Latency

Coda API Limits

Error Handling

Common Errors

Monitoring Duplicates

CLI Output

Telegram Bot Output

Manual Duplicate Resolution

Batch Deduplication

CLI Usage Guide

Coda Column Requirement

Coda Table Setup

Coda Integration Guide

Best Practices

Future Enhancements

Next Steps

Coda Integration

CodaClient API

Architecture

CLI Usage

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guides

Knowledge Base

​Overview

​How It Works

​Implementation Details

​Coda Search Query

​URL Matching Logic

​Edge Cases

​URL Normalization

​Performance

​API Latency

​Coda API Limits

​Error Handling

​Common Errors

​Monitoring Duplicates

​CLI Output

​Telegram Bot Output

​Manual Duplicate Resolution

​Batch Deduplication

CLI Usage Guide

​Coda Column Requirement

​Coda Table Setup

Coda Integration Guide

​Best Practices

​Future Enhancements

​Next Steps

Coda Integration

CodaClient API

Architecture

CLI Usage

Build docs developers (and LLMs) love

Overview

How It Works

Implementation Details

Coda Search Query

URL Matching Logic

Edge Cases

URL Normalization

Performance

API Latency

Coda API Limits

Error Handling

Common Errors

Monitoring Duplicates

CLI Output

Telegram Bot Output

Manual Duplicate Resolution

Batch Deduplication

Coda Column Requirement

Coda Table Setup

Best Practices

Future Enhancements

Next Steps