Skip to main content

Overview

The VectorSearchService handles all vector database operations including storing embeddings, performing similarity searches, and managing document vectors. It supports both cosine similarity and Euclidean distance metrics.

Class Structure

Constructor

public function __construct(
    Database $db,
    $similarityMethod = 'cosine'
)
db
Database
required
Database connection instance
similarityMethod
string
default:"'cosine'"
Similarity calculation method: 'cosine' or 'euclidean'

Instantiation Example

webhook.php
$vectorSearch = new VectorSearchService(
    $db,
    Config::get('rag.similarity_method')
);

Core Methods

searchSimilar()

Finds the most similar document chunks to a query embedding.
public function searchSimilar(
    array $queryEmbedding,
    $topK = 5,
    $threshold = 0.0,
    $maxCandidates = 200
)
queryEmbedding
array
required
Vector embedding of the query (e.g., 1536-dimensional array from OpenAI)
topK
int
default:"5"
Number of most similar results to return
threshold
float
default:"0.0"
Minimum similarity score (0.0-1.0) to include result
maxCandidates
int
default:"200"
Maximum vectors to evaluate (random sample for performance)
Returns: Array of similar chunks sorted by score (descending)

Result Structure

[
    [
        'id' => 123,
        'document_id' => 45,
        'chunk_text' => 'Password reset instructions: 1. Click Forgot Password...',
        'chunk_index' => 2,
        'filename' => '20240315_user_guide.pdf',
        'original_name' => 'user_guide.pdf',
        'score' => 0.92
    ],
    [
        'id' => 456,
        'document_id' => 78,
        'chunk_text' => 'Account security: To change your password...',
        'chunk_index' => 0,
        'filename' => '20240320_faq.pdf',
        'original_name' => 'faq.pdf',
        'score' => 0.85
    ]
]

Implementation Details

1

Candidate Selection

Randomly samples vectors from active documents for performance:
$sql = "SELECT v.id, v.document_id, v.chunk_text, v.chunk_index, 
               v.embedding, d.filename, d.original_name 
        FROM vectors v 
        INNER JOIN documents d ON v.document_id = d.id 
        WHERE d.is_active = 1
        ORDER BY RAND()
        LIMIT {$maxCandidates}";

$vectors = $this->db->fetchAll($sql);
Random sampling provides good performance for medium-sized datasets. For large datasets (>10k vectors), consider implementing HNSW or FAISS indexing.
2

Similarity Calculation

Computes similarity score for each candidate:
foreach ($vectors as $vector) {
    $storedEmbedding = VectorMath::unserializeVector($vector['embedding']);
    
    if ($this->similarityMethod === 'cosine') {
        $score = VectorMath::cosineSimilarity(
            $queryEmbedding,
            $storedEmbedding
        );
    } else {
        $distance = VectorMath::euclideanDistance(
            $queryEmbedding,
            $storedEmbedding
        );
        $score = 1 / (1 + $distance);
    }
    
    if ($score >= $threshold) {
        $results[] = [
            'id' => $vector['id'],
            'chunk_text' => $vector['chunk_text'],
            'score' => $score
            // ... other fields
        ];
    }
}
3

Sorting and Limiting

Sorts by score and returns top K results:
usort($results, function($a, $b) {
    return $b['score'] <=> $a['score'];
});

return array_slice($results, 0, $topK);

Usage Example

RAGService.php
$queryEmbedding = $this->openai->createEmbedding($userMessage);

$similarChunks = $this->vectorSearch->searchSimilar(
    $queryEmbedding,
    $this->topK,      // 3 results
    $this->threshold  // 0.7 minimum score
);

if (!empty($similarChunks)) {
    $contextParts = [];
    foreach ($similarChunks as $chunk) {
        $contextParts[] = $chunk['chunk_text'];
    }
    $context = implode("\n\n", $contextParts);
}

storeVector()

Stores a document chunk embedding in the database.
public function storeVector(
    $documentId,
    $chunkText,
    $chunkIndex,
    array $embedding
)
documentId
int
required
Database ID of the parent document
chunkText
string
required
Text content of the chunk
chunkIndex
int
required
Sequential index of the chunk within the document (0-based)
embedding
array
required
Vector embedding array (e.g., 1536 floats)
Returns: Inserted row ID

Implementation

public function storeVector($documentId, $chunkText, $chunkIndex, array $embedding)
{
    $binaryEmbedding = VectorMath::serializeVector($embedding);
    
    return $this->db->insert('vectors', [
        'document_id' => $documentId,
        'chunk_text' => $chunkText,
        'chunk_index' => $chunkIndex,
        'embedding' => $binaryEmbedding
    ]);
}
Embeddings are serialized to binary format using PHP’s pack() function to save storage space:
VectorMath.php
public static function serializeVector(array $vector)
{
    return pack('f*', ...$vector);
}
A 1536-float array (6144 bytes as floats) is stored as a BLOB.

Usage Example

RAGService.php
public function indexDocument($documentId, $text, $chunkSize = 500, $overlap = 50)
{
    $chunks = \App\Utils\TextProcessor::chunkText($text, $chunkSize, $overlap);
    
    $embeddings = $this->openai->createBatchEmbeddings($chunks);

    $indexed = 0;
    foreach ($chunks as $index => $chunk) {
        if ($embeddings[$index] !== null) {
            $this->vectorSearch->storeVector(
                $documentId,
                $chunk,
                $index,
                $embeddings[$index]
            );
            $indexed++;
        }
    }
    
    return $indexed;
}

deleteVectorsByDocument()

Removes all vectors associated with a document.
public function deleteVectorsByDocument($documentId)
documentId
int
required
Database ID of the document
Returns: Number of rows deleted

Usage Example

// Before re-indexing a document
$vectorSearch->deleteVectorsByDocument($documentId);

// Then index new chunks
$rag->indexDocument($documentId, $newText);

countVectors()

Returns the total number of stored vectors.
public function countVectors()
Returns: Integer count
$totalVectors = $vectorSearch->countVectors();
echo "Knowledge base contains {$totalVectors} indexed chunks\n";

Similarity Methods

Cosine Similarity (Default)

Measures the angle between vectors, ideal for text embeddings.
VectorMath.php
public static function cosineSimilarity(array $vec1, array $vec2)
{
    if (count($vec1) !== count($vec2)) {
        throw new \InvalidArgumentException('Vectors must have the same dimension');
    }

    $dotProduct = 0;
    $magnitude1 = 0;
    $magnitude2 = 0;

    for ($i = 0; $i < count($vec1); $i++) {
        $dotProduct += $vec1[$i] * $vec2[$i];
        $magnitude1 += $vec1[$i] * $vec1[$i];
        $magnitude2 += $vec2[$i] * $vec2[$i];
    }

    $magnitude1 = sqrt($magnitude1);
    $magnitude2 = sqrt($magnitude2);

    if ($magnitude1 == 0 || $magnitude2 == 0) {
        return 0;
    }

    return $dotProduct / ($magnitude1 * $magnitude2);
}
Range: -1.0 to 1.0 (typically 0.0 to 1.0 for embeddings) Best for: Text embeddings (OpenAI, sentence transformers)

Euclidean Distance

Measures straight-line distance between vectors.
VectorMath.php
public static function euclideanDistance(array $vec1, array $vec2)
{
    if (count($vec1) !== count($vec2)) {
        throw new \InvalidArgumentException('Vectors must have the same dimension');
    }

    $sum = 0;
    for ($i = 0; $i < count($vec1); $i++) {
        $diff = $vec1[$i] - $vec2[$i];
        $sum += $diff * $diff;
    }

    return sqrt($sum);
}
Converted to similarity score:
$distance = VectorMath::euclideanDistance($queryEmbedding, $storedEmbedding);
$score = 1 / (1 + $distance);
Range: 0.0 to ∞ (converted to 0.0-1.0 similarity score) Best for: Image embeddings, spatial data

Comparison

Advantages:
  • Scale-invariant (only cares about direction)
  • Standard for text embeddings
  • Better semantic understanding
Disadvantages:
  • Ignores magnitude differences
Use when:
  • Using OpenAI embeddings
  • Comparing text similarity
  • Magnitude is not important
Recommendation: Use cosine similarity (default) for OpenAI text embeddings.

Performance Optimization

Candidate Sampling Strategy

The service uses random sampling to limit computation:
$maxCandidates = 200;  // Evaluate at most 200 vectors
Set high maxCandidates to evaluate all vectors:
$similarChunks = $vectorSearch->searchSimilar(
    $queryEmbedding,
    $topK = 3,
    $threshold = 0.7,
    $maxCandidates = 1000
);
Use default sampling (200-500 candidates):
$maxCandidates = 500;
Keep sampling low and consider advanced indexing:
$maxCandidates = 200;  // Fast but may miss some results
For production at scale, implement:
  • HNSW (Hierarchical Navigable Small World) index
  • FAISS (Facebook AI Similarity Search)
  • Pinecone or Weaviate vector database

Database Schema

CREATE TABLE vectors (
    id INT AUTO_INCREMENT PRIMARY KEY,
    document_id INT NOT NULL,
    chunk_text TEXT NOT NULL,
    chunk_index INT NOT NULL,
    embedding BLOB NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_document_id (document_id),
    INDEX idx_chunk_index (chunk_index),
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The embedding column stores binary-packed floats. For a 1536-dimension embedding:
  • Size: 1536 floats × 4 bytes = 6,144 bytes per vector
  • Storage type: BLOB (65,535 bytes max)

Query Optimization

// Only search active documents
WHERE d.is_active = 1

// Use index on document_id for fast joins
INNER JOIN documents d ON v.document_id = d.id

Vector Serialization

The service uses VectorMath utility for efficient storage:
public static function serializeVector(array $vector)
{
    return pack('f*', ...$vector);
}

Storage Comparison

JSON Storage

1536 floats as JSON: ~10KB
❌ Inefficient, slow parsing

Binary Storage

1536 floats packed: ~6KB
✅ Compact, fast unpacking

Error Handling

try {
    $similarChunks = $vectorSearch->searchSimilar(
        $queryEmbedding,
        $topK,
        $threshold
    );
    
    if (empty($similarChunks)) {
        $logger->warning('No similar chunks found', [
            'threshold' => $threshold,
            'topK' => $topK
        ]);
        // Handle no results case
    }
} catch (\InvalidArgumentException $e) {
    $logger->error('Vector dimension mismatch: ' . $e->getMessage());
    // Vectors must have same dimensions
} catch (\Exception $e) {
    $logger->error('Vector search error: ' . $e->getMessage());
    throw $e;
}

Configuration

Configure vector search in config/config.php:
config/config.php
return [
    'rag' => [
        'similarity_method' => 'cosine',  // or 'euclidean'
        'similarity_threshold' => 0.7,
        'top_k' => 3,
        'max_candidates' => 200
    ]
];

Best Practices

  • Too low (1-2): May miss relevant context
  • Too high (10+): Adds noise, increases token usage
  • Recommended: 3-5 for most use cases
Test your threshold with sample queries:
$testQueries = [
    "How to reset password" => 0.85,  // Expected high score
    "Weather today" => 0.15          // Expected low score
];

foreach ($testQueries as $query => $expectedMin) {
    $embedding = $openai->createEmbedding($query);
    $results = $vectorSearch->searchSimilar($embedding, 5, 0.0);
    
    echo "Query: {$query}\n";
    echo "Top score: {$results[0]['score']}\n";
    assert($results[0]['score'] >= $expectedMin);
}
Log search performance metrics:
$startTime = microtime(true);
$results = $vectorSearch->searchSimilar($queryEmbedding, 3, 0.7, 200);
$duration = microtime(true) - $startTime;

$logger->info('Vector search completed', [
    'duration_ms' => round($duration * 1000, 2),
    'results_count' => count($results),
    'max_score' => $results[0]['score'] ?? 0
]);
Periodically remove vectors from deleted documents:
// Database schema includes ON DELETE CASCADE, but manual cleanup:
$db->query(
    "DELETE v FROM vectors v 
     LEFT JOIN documents d ON v.document_id = d.id 
     WHERE d.id IS NULL"
);

Scaling Considerations

The current implementation uses in-memory similarity calculation, which works well up to ~10,000 vectors. Beyond that, consider:
  1. Vector Database: Migrate to Pinecone, Weaviate, or Qdrant
  2. HNSW Index: Implement approximate nearest neighbor search
  3. Batch Processing: Pre-compute similarities for common queries
  4. Caching Layer: Cache popular query results

Migration Path to Production Vector DB

// Example: Pinecone integration
class PineconeVectorSearchService implements VectorSearchInterface
{
    public function searchSimilar(array $queryEmbedding, $topK, $threshold)
    {
        $response = $this->pinecone->query([
            'vector' => $queryEmbedding,
            'topK' => $topK,
            'includeMetadata' => true
        ]);
        
        return array_filter($response['matches'], function($match) use ($threshold) {
            return $match['score'] >= $threshold;
        });
    }
}

RAG Service

Uses vector search to retrieve relevant context

OpenAI Service

Generates embeddings stored by vector search

Next Steps

Index Documents

Learn how to upload and index documents

Tune Search

Optimize similarity thresholds and topK

Build docs developers (and LLMs) love