Vector Search Service - WhatsApp RAG Bot

Overview

The VectorSearchService handles all vector database operations including storing embeddings, performing similarity searches, and managing document vectors. It supports both cosine similarity and Euclidean distance metrics.

Class Structure

Constructor

public function __construct(
    Database $db,
    $similarityMethod = 'cosine'
)

Database

required

Database connection instance

similarityMethod

string

default:"'cosine'"

Similarity calculation method: 'cosine' or 'euclidean'

Instantiation Example

webhook.php

$vectorSearch = new VectorSearchService(
    $db,
    Config::get('rag.similarity_method')
);

Core Methods

searchSimilar()

Finds the most similar document chunks to a query embedding.

public function searchSimilar(
    array $queryEmbedding,
    $topK = 5,
    $threshold = 0.0,
    $maxCandidates = 200
)

queryEmbedding

array

required

Vector embedding of the query (e.g., 1536-dimensional array from OpenAI)

topK

int

default:"5"

Number of most similar results to return

threshold

float

default:"0.0"

Minimum similarity score (0.0-1.0) to include result

maxCandidates

int

default:"200"

Maximum vectors to evaluate (random sample for performance)

Returns: Array of similar chunks sorted by score (descending)

Result Structure

[
    [
        'id' => 123,
        'document_id' => 45,
        'chunk_text' => 'Password reset instructions: 1. Click Forgot Password...',
        'chunk_index' => 2,
        'filename' => '20240315_user_guide.pdf',
        'original_name' => 'user_guide.pdf',
        'score' => 0.92
    ],
    [
        'id' => 456,
        'document_id' => 78,
        'chunk_text' => 'Account security: To change your password...',
        'chunk_index' => 0,
        'filename' => '20240320_faq.pdf',
        'original_name' => 'faq.pdf',
        'score' => 0.85
    ]
]

Implementation Details

Candidate Selection

Randomly samples vectors from active documents for performance:

$sql = "SELECT v.id, v.document_id, v.chunk_text, v.chunk_index, 
               v.embedding, d.filename, d.original_name 
        FROM vectors v 
        INNER JOIN documents d ON v.document_id = d.id 
        WHERE d.is_active = 1
        ORDER BY RAND()
        LIMIT {$maxCandidates}";

$vectors = $this->db->fetchAll($sql);

Random sampling provides good performance for medium-sized datasets. For large datasets (>10k vectors), consider implementing HNSW or FAISS indexing.

Similarity Calculation

Computes similarity score for each candidate:

foreach ($vectors as $vector) {
    $storedEmbedding = VectorMath::unserializeVector($vector['embedding']);
    
    if ($this->similarityMethod === 'cosine') {
        $score = VectorMath::cosineSimilarity(
            $queryEmbedding,
            $storedEmbedding
        );
    } else {
        $distance = VectorMath::euclideanDistance(
            $queryEmbedding,
            $storedEmbedding
        );
        $score = 1 / (1 + $distance);
    }
    
    if ($score >= $threshold) {
        $results[] = [
            'id' => $vector['id'],
            'chunk_text' => $vector['chunk_text'],
            'score' => $score
            // ... other fields
        ];
    }
}

Sorting and Limiting

Sorts by score and returns top K results:

usort($results, function($a, $b) {
    return $b['score'] <=> $a['score'];
});

return array_slice($results, 0, $topK);

Usage Example

RAGService.php

$queryEmbedding = $this->openai->createEmbedding($userMessage);

$similarChunks = $this->vectorSearch->searchSimilar(
    $queryEmbedding,
    $this->topK,      // 3 results
    $this->threshold  // 0.7 minimum score
);

if (!empty($similarChunks)) {
    $contextParts = [];
    foreach ($similarChunks as $chunk) {
        $contextParts[] = $chunk['chunk_text'];
    }
    $context = implode("\n\n", $contextParts);
}

storeVector()

Stores a document chunk embedding in the database.

public function storeVector(
    $documentId,
    $chunkText,
    $chunkIndex,
    array $embedding
)

documentId

int

required

Database ID of the parent document

chunkText

string

required

Text content of the chunk

chunkIndex

int

required

Sequential index of the chunk within the document (0-based)

embedding

array

required

Vector embedding array (e.g., 1536 floats)

Returns: Inserted row ID

Implementation

public function storeVector($documentId, $chunkText, $chunkIndex, array $embedding)
{
    $binaryEmbedding = VectorMath::serializeVector($embedding);
    
    return $this->db->insert('vectors', [
        'document_id' => $documentId,
        'chunk_text' => $chunkText,
        'chunk_index' => $chunkIndex,
        'embedding' => $binaryEmbedding
    ]);
}

Embeddings are serialized to binary format using PHP’s pack() function to save storage space:

VectorMath.php

public static function serializeVector(array $vector)
{
    return pack('f*', ...$vector);
}

A 1536-float array (6144 bytes as floats) is stored as a BLOB.

Usage Example

RAGService.php

public function indexDocument($documentId, $text, $chunkSize = 500, $overlap = 50)
{
    $chunks = \App\Utils\TextProcessor::chunkText($text, $chunkSize, $overlap);
    
    $embeddings = $this->openai->createBatchEmbeddings($chunks);

    $indexed = 0;
    foreach ($chunks as $index => $chunk) {
        if ($embeddings[$index] !== null) {
            $this->vectorSearch->storeVector(
                $documentId,
                $chunk,
                $index,
                $embeddings[$index]
            );
            $indexed++;
        }
    }
    
    return $indexed;
}

deleteVectorsByDocument()

Removes all vectors associated with a document.

public function deleteVectorsByDocument($documentId)

documentId

int

required

Database ID of the document

Returns: Number of rows deleted

Usage Example

// Before re-indexing a document
$vectorSearch->deleteVectorsByDocument($documentId);

// Then index new chunks
$rag->indexDocument($documentId, $newText);

countVectors()

Returns the total number of stored vectors.

public function countVectors()

Returns: Integer count

$totalVectors = $vectorSearch->countVectors();
echo "Knowledge base contains {$totalVectors} indexed chunks\n";

Similarity Methods

Cosine Similarity (Default)

Measures the angle between vectors, ideal for text embeddings.

VectorMath.php

public static function cosineSimilarity(array $vec1, array $vec2)
{
    if (count($vec1) !== count($vec2)) {
        throw new \InvalidArgumentException('Vectors must have the same dimension');
    }

    $dotProduct = 0;
    $magnitude1 = 0;
    $magnitude2 = 0;

    for ($i = 0; $i < count($vec1); $i++) {
        $dotProduct += $vec1[$i] * $vec2[$i];
        $magnitude1 += $vec1[$i] * $vec1[$i];
        $magnitude2 += $vec2[$i] * $vec2[$i];
    }

    $magnitude1 = sqrt($magnitude1);
    $magnitude2 = sqrt($magnitude2);

    if ($magnitude1 == 0 || $magnitude2 == 0) {
        return 0;
    }

    return $dotProduct / ($magnitude1 * $magnitude2);
}

Range: -1.0 to 1.0 (typically 0.0 to 1.0 for embeddings) Best for: Text embeddings (OpenAI, sentence transformers)

Euclidean Distance

Measures straight-line distance between vectors.

VectorMath.php

public static function euclideanDistance(array $vec1, array $vec2)
{
    if (count($vec1) !== count($vec2)) {
        throw new \InvalidArgumentException('Vectors must have the same dimension');
    }

    $sum = 0;
    for ($i = 0; $i < count($vec1); $i++) {
        $diff = $vec1[$i] - $vec2[$i];
        $sum += $diff * $diff;
    }

    return sqrt($sum);
}

Converted to similarity score:

$distance = VectorMath::euclideanDistance($queryEmbedding, $storedEmbedding);
$score = 1 / (1 + $distance);

Range: 0.0 to ∞ (converted to 0.0-1.0 similarity score) Best for: Image embeddings, spatial data

Comparison

Cosine Similarity
Euclidean Distance

Advantages:

Scale-invariant (only cares about direction)
Standard for text embeddings
Better semantic understanding

Disadvantages:

Ignores magnitude differences

Use when:

Using OpenAI embeddings
Comparing text similarity
Magnitude is not important

Recommendation: Use cosine similarity (default) for OpenAI text embeddings.

Performance Optimization

Candidate Sampling Strategy

The service uses random sampling to limit computation:

$maxCandidates = 200;  // Evaluate at most 200 vectors

Small Dataset (<500 vectors)

Set high maxCandidates to evaluate all vectors:

$similarChunks = $vectorSearch->searchSimilar(
    $queryEmbedding,
    $topK = 3,
    $threshold = 0.7,
    $maxCandidates = 1000
);

Medium Dataset (500-5k vectors)

Use default sampling (200-500 candidates):

$maxCandidates = 500;

Large Dataset (>5k vectors)

Keep sampling low and consider advanced indexing:

$maxCandidates = 200;  // Fast but may miss some results

For production at scale, implement:

HNSW (Hierarchical Navigable Small World) index
FAISS (Facebook AI Similarity Search)
Pinecone or Weaviate vector database

Database Schema

CREATE TABLE vectors (
    id INT AUTO_INCREMENT PRIMARY KEY,
    document_id INT NOT NULL,
    chunk_text TEXT NOT NULL,
    chunk_index INT NOT NULL,
    embedding BLOB NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_document_id (document_id),
    INDEX idx_chunk_index (chunk_index),
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

The embedding column stores binary-packed floats. For a 1536-dimension embedding:

Size: 1536 floats × 4 bytes = 6,144 bytes per vector
Storage type: BLOB (65,535 bytes max)

Query Optimization

// Only search active documents
WHERE d.is_active = 1

// Use index on document_id for fast joins
INNER JOIN documents d ON v.document_id = d.id

Vector Serialization

The service uses VectorMath utility for efficient storage:

public static function serializeVector(array $vector)
{
    return pack('f*', ...$vector);
}

Storage Comparison

JSON Storage

1536 floats as JSON: ~10KB

❌ Inefficient, slow parsing

Binary Storage

1536 floats packed: ~6KB

✅ Compact, fast unpacking

Error Handling

try {
    $similarChunks = $vectorSearch->searchSimilar(
        $queryEmbedding,
        $topK,
        $threshold
    );
    
    if (empty($similarChunks)) {
        $logger->warning('No similar chunks found', [
            'threshold' => $threshold,
            'topK' => $topK
        ]);
        // Handle no results case
    }
} catch (\InvalidArgumentException $e) {
    $logger->error('Vector dimension mismatch: ' . $e->getMessage());
    // Vectors must have same dimensions
} catch (\Exception $e) {
    $logger->error('Vector search error: ' . $e->getMessage());
    throw $e;
}

Configuration

Configure vector search in config/config.php:

config/config.php

return [
    'rag' => [
        'similarity_method' => 'cosine',  // or 'euclidean'
        'similarity_threshold' => 0.7,
        'top_k' => 3,
        'max_candidates' => 200
    ]
];

Best Practices

Choose Appropriate topK

Too low (1-2): May miss relevant context
Too high (10+): Adds noise, increases token usage
Recommended: 3-5 for most use cases

Set Meaningful Thresholds

Test your threshold with sample queries:

$testQueries = [
    "How to reset password" => 0.85,  // Expected high score
    "Weather today" => 0.15          // Expected low score
];

foreach ($testQueries as $query => $expectedMin) {
    $embedding = $openai->createEmbedding($query);
    $results = $vectorSearch->searchSimilar($embedding, 5, 0.0);
    
    echo "Query: {$query}\n";
    echo "Top score: {$results[0]['score']}\n";
    assert($results[0]['score'] >= $expectedMin);
}

Monitor Performance

Log search performance metrics:

$startTime = microtime(true);
$results = $vectorSearch->searchSimilar($queryEmbedding, 3, 0.7, 200);
$duration = microtime(true) - $startTime;

$logger->info('Vector search completed', [
    'duration_ms' => round($duration * 1000, 2),
    'results_count' => count($results),
    'max_score' => $results[0]['score'] ?? 0
]);

Clean Up Orphaned Vectors

Periodically remove vectors from deleted documents:

// Database schema includes ON DELETE CASCADE, but manual cleanup:
$db->query(
    "DELETE v FROM vectors v 
     LEFT JOIN documents d ON v.document_id = d.id 
     WHERE d.id IS NULL"
);

Scaling Considerations

The current implementation uses in-memory similarity calculation, which works well up to ~10,000 vectors. Beyond that, consider:

Vector Database: Migrate to Pinecone, Weaviate, or Qdrant
HNSW Index: Implement approximate nearest neighbor search
Batch Processing: Pre-compute similarities for common queries
Caching Layer: Cache popular query results

Migration Path to Production Vector DB

// Example: Pinecone integration
class PineconeVectorSearchService implements VectorSearchInterface
{
    public function searchSimilar(array $queryEmbedding, $topK, $threshold)
    {
        $response = $this->pinecone->query([
            'vector' => $queryEmbedding,
            'topK' => $topK,
            'includeMetadata' => true
        ]);
        
        return array_filter($response['matches'], function($match) use ($threshold) {
            return $match['score'] >= $threshold;
        });
    }
}

RAG Service

Uses vector search to retrieve relevant context

OpenAI Service

Generates embeddings stored by vector search

Next Steps

Index Documents

Learn how to upload and index documents

Tune Search

Optimize similarity thresholds and topK

Core Services

Business Logic

Infrastructure

​Overview

​Class Structure

​Constructor

​Instantiation Example

​Core Methods

​searchSimilar()

​Result Structure

​Implementation Details

​Usage Example

​storeVector()

​Implementation

​Usage Example

​deleteVectorsByDocument()

​Usage Example

​countVectors()

​Similarity Methods

​Cosine Similarity (Default)

​Euclidean Distance

​Comparison

​Performance Optimization

​Candidate Sampling Strategy

​Database Schema

​Query Optimization

​Vector Serialization

​Storage Comparison

JSON Storage

Binary Storage

​Error Handling

​Configuration

​Best Practices

​Scaling Considerations

​Migration Path to Production Vector DB

​Related Services

RAG Service

OpenAI Service

​Next Steps

Index Documents

Tune Search

Build docs developers (and LLMs) love

Overview

Class Structure

Constructor

Instantiation Example

Core Methods

searchSimilar()

Result Structure

Implementation Details

Usage Example

storeVector()

Implementation

Usage Example

deleteVectorsByDocument()

Usage Example

countVectors()

Similarity Methods

Cosine Similarity (Default)

Euclidean Distance

Comparison

Performance Optimization

Candidate Sampling Strategy

Database Schema

Query Optimization

Vector Serialization

Storage Comparison

Error Handling

Configuration

Best Practices

Scaling Considerations

Migration Path to Production Vector DB

Related Services

Next Steps