Overview
The VectorSearchService handles all vector database operations including storing embeddings, performing similarity searches, and managing document vectors. It supports both cosine similarity and Euclidean distance metrics.
Class Structure
Constructor
public function __construct (
Database $db ,
$similarityMethod = 'cosine'
)
Database connection instance
Similarity calculation method: 'cosine' or 'euclidean'
Instantiation Example
$vectorSearch = new VectorSearchService (
$db ,
Config :: get ( 'rag.similarity_method' )
);
Core Methods
searchSimilar()
Finds the most similar document chunks to a query embedding.
public function searchSimilar (
array $queryEmbedding ,
$topK = 5 ,
$threshold = 0.0 ,
$maxCandidates = 200
)
Vector embedding of the query (e.g., 1536-dimensional array from OpenAI)
Number of most similar results to return
Minimum similarity score (0.0-1.0) to include result
Maximum vectors to evaluate (random sample for performance)
Returns: Array of similar chunks sorted by score (descending)
Result Structure
[
[
'id' => 123 ,
'document_id' => 45 ,
'chunk_text' => 'Password reset instructions: 1. Click Forgot Password...' ,
'chunk_index' => 2 ,
'filename' => '20240315_user_guide.pdf' ,
'original_name' => 'user_guide.pdf' ,
'score' => 0.92
],
[
'id' => 456 ,
'document_id' => 78 ,
'chunk_text' => 'Account security: To change your password...' ,
'chunk_index' => 0 ,
'filename' => '20240320_faq.pdf' ,
'original_name' => 'faq.pdf' ,
'score' => 0.85
]
]
Implementation Details
Candidate Selection
Randomly samples vectors from active documents for performance: $sql = " SELECT v . id , v . document_id , v . chunk_text , v . chunk_index ,
v . embedding , d . filename , d . original_name
FROM vectors v
INNER JOIN documents d ON v . document_id = d . id
WHERE d . is_active = 1
ORDER BY RAND ()
LIMIT { $maxCandidates }" ;
$vectors = $this -> db -> fetchAll ( $sql );
Random sampling provides good performance for medium-sized datasets. For large datasets (>10k vectors), consider implementing HNSW or FAISS indexing.
Similarity Calculation
Computes similarity score for each candidate: foreach ( $vectors as $vector ) {
$storedEmbedding = VectorMath :: unserializeVector ( $vector [ 'embedding' ]);
if ( $this -> similarityMethod === 'cosine' ) {
$score = VectorMath :: cosineSimilarity (
$queryEmbedding ,
$storedEmbedding
);
} else {
$distance = VectorMath :: euclideanDistance (
$queryEmbedding ,
$storedEmbedding
);
$score = 1 / ( 1 + $distance );
}
if ( $score >= $threshold ) {
$results [] = [
'id' => $vector [ 'id' ],
'chunk_text' => $vector [ 'chunk_text' ],
'score' => $score
// ... other fields
];
}
}
Sorting and Limiting
Sorts by score and returns top K results: usort ( $results , function ( $a , $b ) {
return $b [ 'score' ] <=> $a [ 'score' ];
});
return array_slice ( $results , 0 , $topK );
Usage Example
$queryEmbedding = $this -> openai -> createEmbedding ( $userMessage );
$similarChunks = $this -> vectorSearch -> searchSimilar (
$queryEmbedding ,
$this -> topK , // 3 results
$this -> threshold // 0.7 minimum score
);
if ( ! empty ( $similarChunks )) {
$contextParts = [];
foreach ( $similarChunks as $chunk ) {
$contextParts [] = $chunk [ 'chunk_text' ];
}
$context = implode ( " \n\n " , $contextParts );
}
storeVector()
Stores a document chunk embedding in the database.
public function storeVector (
$documentId ,
$chunkText ,
$chunkIndex ,
array $embedding
)
Database ID of the parent document
Text content of the chunk
Sequential index of the chunk within the document (0-based)
Vector embedding array (e.g., 1536 floats)
Returns: Inserted row ID
Implementation
public function storeVector ( $documentId , $chunkText , $chunkIndex , array $embedding )
{
$binaryEmbedding = VectorMath :: serializeVector ( $embedding );
return $this -> db -> insert ( 'vectors' , [
'document_id' => $documentId ,
'chunk_text' => $chunkText ,
'chunk_index' => $chunkIndex ,
'embedding' => $binaryEmbedding
]);
}
Embeddings are serialized to binary format using PHP’s pack() function to save storage space: public static function serializeVector ( array $vector )
{
return pack ( 'f*' , ... $vector );
}
A 1536-float array (6144 bytes as floats) is stored as a BLOB.
Usage Example
public function indexDocument ( $documentId , $text , $chunkSize = 500 , $overlap = 50 )
{
$chunks = \App\Utils\ TextProcessor :: chunkText ( $text , $chunkSize , $overlap );
$embeddings = $this -> openai -> createBatchEmbeddings ( $chunks );
$indexed = 0 ;
foreach ( $chunks as $index => $chunk ) {
if ( $embeddings [ $index ] !== null ) {
$this -> vectorSearch -> storeVector (
$documentId ,
$chunk ,
$index ,
$embeddings [ $index ]
);
$indexed ++ ;
}
}
return $indexed ;
}
deleteVectorsByDocument()
Removes all vectors associated with a document.
public function deleteVectorsByDocument ( $documentId )
Database ID of the document
Returns: Number of rows deleted
Usage Example
// Before re-indexing a document
$vectorSearch -> deleteVectorsByDocument ( $documentId );
// Then index new chunks
$rag -> indexDocument ( $documentId , $newText );
countVectors()
Returns the total number of stored vectors.
public function countVectors ()
Returns: Integer count
$totalVectors = $vectorSearch -> countVectors ();
echo "Knowledge base contains { $totalVectors } indexed chunks \n " ;
Similarity Methods
Cosine Similarity (Default)
Measures the angle between vectors, ideal for text embeddings.
public static function cosineSimilarity ( array $vec1 , array $vec2 )
{
if ( count ( $vec1 ) !== count ( $vec2 )) {
throw new \InvalidArgumentException ( 'Vectors must have the same dimension' );
}
$dotProduct = 0 ;
$magnitude1 = 0 ;
$magnitude2 = 0 ;
for ( $i = 0 ; $i < count ( $vec1 ); $i ++ ) {
$dotProduct += $vec1 [ $i ] * $vec2 [ $i ];
$magnitude1 += $vec1 [ $i ] * $vec1 [ $i ];
$magnitude2 += $vec2 [ $i ] * $vec2 [ $i ];
}
$magnitude1 = sqrt ( $magnitude1 );
$magnitude2 = sqrt ( $magnitude2 );
if ( $magnitude1 == 0 || $magnitude2 == 0 ) {
return 0 ;
}
return $dotProduct / ( $magnitude1 * $magnitude2 );
}
Range: -1.0 to 1.0 (typically 0.0 to 1.0 for embeddings)
Best for: Text embeddings (OpenAI, sentence transformers)
Euclidean Distance
Measures straight-line distance between vectors.
public static function euclideanDistance ( array $vec1 , array $vec2 )
{
if ( count ( $vec1 ) !== count ( $vec2 )) {
throw new \InvalidArgumentException ( 'Vectors must have the same dimension' );
}
$sum = 0 ;
for ( $i = 0 ; $i < count ( $vec1 ); $i ++ ) {
$diff = $vec1 [ $i ] - $vec2 [ $i ];
$sum += $diff * $diff ;
}
return sqrt ( $sum );
}
Converted to similarity score:
$distance = VectorMath :: euclideanDistance ( $queryEmbedding , $storedEmbedding );
$score = 1 / ( 1 + $distance );
Range: 0.0 to ∞ (converted to 0.0-1.0 similarity score)
Best for: Image embeddings, spatial data
Comparison
Cosine Similarity
Euclidean Distance
Advantages:
Scale-invariant (only cares about direction)
Standard for text embeddings
Better semantic understanding
Disadvantages:
Ignores magnitude differences
Use when:
Using OpenAI embeddings
Comparing text similarity
Magnitude is not important
Advantages:
Considers both direction and magnitude
Intuitive geometric interpretation
Disadvantages:
Sensitive to vector scale
Not standard for text embeddings
Use when:
Comparing image embeddings
Magnitude matters
Working with normalized vectors
Recommendation: Use cosine similarity (default) for OpenAI text embeddings.
Candidate Sampling Strategy
The service uses random sampling to limit computation:
$maxCandidates = 200 ; // Evaluate at most 200 vectors
Small Dataset (<500 vectors)
Set high maxCandidates to evaluate all vectors: $similarChunks = $vectorSearch -> searchSimilar (
$queryEmbedding ,
$topK = 3 ,
$threshold = 0.7 ,
$maxCandidates = 1000
);
Medium Dataset (500-5k vectors)
Use default sampling (200-500 candidates):
Large Dataset (>5k vectors)
Keep sampling low and consider advanced indexing: $maxCandidates = 200 ; // Fast but may miss some results
For production at scale, implement:
HNSW (Hierarchical Navigable Small World) index
FAISS (Facebook AI Similarity Search)
Pinecone or Weaviate vector database
Database Schema
CREATE TABLE vectors (
id INT AUTO_INCREMENT PRIMARY KEY ,
document_id INT NOT NULL ,
chunk_text TEXT NOT NULL ,
chunk_index INT NOT NULL ,
embedding BLOB NOT NULL ,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_document_id (document_id),
INDEX idx_chunk_index (chunk_index),
FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
The embedding column stores binary-packed floats. For a 1536-dimension embedding:
Size: 1536 floats × 4 bytes = 6,144 bytes per vector
Storage type: BLOB (65,535 bytes max)
Query Optimization
// Only search active documents
WHERE d . is_active = 1
// Use index on document_id for fast joins
INNER JOIN documents d ON v . document_id = d . id
Vector Serialization
The service uses VectorMath utility for efficient storage:
Serialize (Store)
Unserialize (Retrieve)
public static function serializeVector ( array $vector )
{
return pack ( 'f*' , ... $vector );
}
Storage Comparison
JSON Storage 1536 floats as JSON: ~10KB
❌ Inefficient, slow parsing
Binary Storage ✅ Compact, fast unpacking
Error Handling
try {
$similarChunks = $vectorSearch -> searchSimilar (
$queryEmbedding ,
$topK ,
$threshold
);
if ( empty ( $similarChunks )) {
$logger -> warning ( 'No similar chunks found' , [
'threshold' => $threshold ,
'topK' => $topK
]);
// Handle no results case
}
} catch ( \ InvalidArgumentException $e ) {
$logger -> error ( 'Vector dimension mismatch: ' . $e -> getMessage ());
// Vectors must have same dimensions
} catch ( \ Exception $e ) {
$logger -> error ( 'Vector search error: ' . $e -> getMessage ());
throw $e ;
}
Configuration
Configure vector search in config/config.php:
return [
'rag' => [
'similarity_method' => 'cosine' , // or 'euclidean'
'similarity_threshold' => 0.7 ,
'top_k' => 3 ,
'max_candidates' => 200
]
];
Best Practices
Too low (1-2): May miss relevant context
Too high (10+): Adds noise, increases token usage
Recommended: 3-5 for most use cases
Set Meaningful Thresholds
Test your threshold with sample queries: $testQueries = [
"How to reset password" => 0.85 , // Expected high score
"Weather today" => 0.15 // Expected low score
];
foreach ( $testQueries as $query => $expectedMin ) {
$embedding = $openai -> createEmbedding ( $query );
$results = $vectorSearch -> searchSimilar ( $embedding , 5 , 0.0 );
echo "Query: { $query } \n " ;
echo "Top score: { $results [ 0 ]['score']} \n " ;
assert ( $results [ 0 ][ 'score' ] >= $expectedMin );
}
Clean Up Orphaned Vectors
Periodically remove vectors from deleted documents: // Database schema includes ON DELETE CASCADE, but manual cleanup:
$db -> query (
" DELETE v FROM vectors v
LEFT JOIN documents d ON v . document_id = d . id
WHERE d . id IS NULL "
);
Scaling Considerations
The current implementation uses in-memory similarity calculation, which works well up to ~10,000 vectors. Beyond that, consider:
Vector Database: Migrate to Pinecone, Weaviate, or Qdrant
HNSW Index: Implement approximate nearest neighbor search
Batch Processing: Pre-compute similarities for common queries
Caching Layer: Cache popular query results
Migration Path to Production Vector DB
// Example: Pinecone integration
class PineconeVectorSearchService implements VectorSearchInterface
{
public function searchSimilar ( array $queryEmbedding , $topK , $threshold )
{
$response = $this -> pinecone -> query ([
'vector' => $queryEmbedding ,
'topK' => $topK ,
'includeMetadata' => true
]);
return array_filter ( $response [ 'matches' ], function ( $match ) use ( $threshold ) {
return $match [ 'score' ] >= $threshold ;
});
}
}
RAG Service Uses vector search to retrieve relevant context
OpenAI Service Generates embeddings stored by vector search
Next Steps
Index Documents Learn how to upload and index documents
Tune Search Optimize similarity thresholds and topK