Collections Guide
Collections in VecLabs are equivalent to Pinecone indexes. Each collection stores vectors of a fixed dimension and uses a specific distance metric for similarity search.
What is a Collection?
A collection is an isolated namespace for storing and querying vectors:
Fixed dimension : All vectors must have the same dimension
Single distance metric : Cosine, Euclidean, or Dot Product
Unique namespace : Collections are isolated from each other
On-chain verification : Each collection has a Merkle root stored on Solana
Think of collections as separate tables in a database. Each collection is optimized for a specific embedding model and use case.
Creating Collections
import { SolVec } from 'solvec' ;
const sv = new SolVec ({ network: 'devnet' });
const col = sv . collection ( 'my-collection' , {
dimensions: 1536 ,
metric: 'cosine'
});
from solvec import SolVec
sv = SolVec( network = "devnet" )
col = sv.collection(
"my-collection" ,
dimensions = 1536 ,
metric = "cosine"
)
Collections are created lazily. The collection is initialized on first upsert, not on collection() call.
Dimension Configuration
Common Embedding Dimensions
OpenAI text-embedding-3-small 1536 dimensions Fast, cost-effective, high quality
OpenAI text-embedding-3-large 3072 dimensions Highest accuracy for critical tasks
Cohere embed-english-v3.0 1024 dimensions Optimized for semantic search
Google Gemini embedding-001 768 dimensions Fast, free tier available
Setting Dimensions
// OpenAI text-embedding-3-small
const col1 = sv . collection ( 'openai-small' , { dimensions: 1536 });
// OpenAI text-embedding-3-large
const col2 = sv . collection ( 'openai-large' , { dimensions: 3072 });
// Cohere embed-english-v3.0
const col3 = sv . collection ( 'cohere' , { dimensions: 1024 });
// Custom dimension
const col4 = sv . collection ( 'custom' , { dimensions: 384 });
# OpenAI text-embedding-3-small
col1 = sv.collection( "openai-small" , dimensions = 1536 )
# OpenAI text-embedding-3-large
col2 = sv.collection( "openai-large" , dimensions = 3072 )
# Cohere embed-english-v3.0
col3 = sv.collection( "cohere" , dimensions = 1024 )
# Custom dimension
col4 = sv.collection( "custom" , dimensions = 384 )
Default dimension is 1536 (OpenAI text-embedding-3-small). Only specify dimensions if you’re using a different model.
Distance Metrics
VecLabs supports three distance metrics:
1. Cosine Similarity (Default)
Best for : Most use cases, especially semantic search
How it works : Measures the angle between vectors, ignoring magnitude
Range : 0.0 to 1.0 (1.0 = identical direction)
const col = sv . collection ( 'semantic-search' , {
dimensions: 1536 ,
metric: 'cosine'
});
col = sv.collection(
"semantic-search" ,
dimensions = 1536 ,
metric = "cosine"
)
Use cases :
Document similarity
Semantic search
Recommendation systems
Any case where vector magnitude is not meaningful
2. Euclidean Distance
Best for : When absolute distance matters
How it works : Measures straight-line distance between vectors
Range : 0.0 to 1.0 (1.0 = closest, inverted for compatibility)
const col = sv . collection ( 'spatial-vectors' , {
dimensions: 128 ,
metric: 'euclidean'
});
col = sv.collection(
"spatial-vectors" ,
dimensions = 128 ,
metric = "euclidean"
)
Use cases :
Image embeddings
Audio similarity
When vector magnitude is meaningful
3. Dot Product
Best for : Pre-normalized vectors
How it works : Computes dot product (sum of element-wise multiplication)
Range : Unbounded (higher = more similar)
const col = sv . collection ( 'normalized-vectors' , {
dimensions: 768 ,
metric: 'dot'
});
col = sv.collection(
"normalized-vectors" ,
dimensions = 768 ,
metric = "dot"
)
Dot product assumes vectors are normalized to unit length. If vectors are not normalized, results may be unexpected.
Use cases :
Pre-normalized embeddings
When you need maximum query speed
Matrix factorization embeddings
Metric Comparison
Metric Speed Accuracy Requires Normalization Best For Cosine Fast High No General purpose, semantic search Euclidean Fast High No Image/audio, spatial data Dot Product Fastest High Yes Pre-normalized vectors
Collection Statistics
Get detailed information about a collection:
const stats = await col . describeIndexStats ();
console . log ( stats );
// {
// vectorCount: 1542,
// dimension: 1536,
// metric: 'cosine',
// name: 'agent-memory',
// merkleRoot: 'a3f2c1e4...',
// lastUpdated: 1678901234567,
// isFrozen: false
// }
stats = col.describe_index_stats()
print (stats.vector_count) # 1542
print (stats.dimension) # 1536
print (stats.metric) # DistanceMetric.COSINE
print (stats.name) # 'agent-memory'
print (stats.merkle_root) # 'a3f2c1e4...'
print (stats.last_updated) # 1678901234567
print (stats.is_frozen) # False
Stats Fields
vectorCount : Total number of vectors in the collection
dimension : Vector dimension (fixed at collection creation)
metric : Distance metric used
name : Collection name
merkleRoot : Current Merkle root hash
lastUpdated : Timestamp of last update (milliseconds)
isFrozen : Whether collection is frozen (coming soon)
Managing Multiple Collections
Use Case: Multiple Embedding Models
const sv = new SolVec ({ network: 'devnet' });
// Separate collections for different models
const openai = sv . collection ( 'openai-embeddings' , {
dimensions: 1536 ,
metric: 'cosine'
});
const cohere = sv . collection ( 'cohere-embeddings' , {
dimensions: 1024 ,
metric: 'cosine'
});
// Each collection is independent
await openai . upsert ([{ id: 'doc1' , values: [ ... ] }]);
await cohere . upsert ([{ id: 'doc1' , values: [ ... ] }]); // Same ID, different collection
Use Case: Development vs Production
const sv = new SolVec ({ network: 'devnet' });
// Development collection
const devCol = sv . collection ( 'vectors-dev' , { dimensions: 1536 });
// Production collection (same network, different namespace)
const prodCol = sv . collection ( 'vectors-prod' , { dimensions: 1536 });
Use Case: Multi-Tenant Application
function getUserCollection ( userId : string ) {
return sv . collection ( `user- ${ userId } -memories` , {
dimensions: 1536 ,
metric: 'cosine'
});
}
const aliceCol = getUserCollection ( 'alice' );
const bobCol = getUserCollection ( 'bob' );
// Each user has isolated vectors
await aliceCol . upsert ([{ id: 'mem_1' , values: [ ... ] }]);
await bobCol . upsert ([{ id: 'mem_1' , values: [ ... ] }]);
Each collection incurs separate storage and query costs. Don’t create excessive collections unnecessarily.
Listing Collections
const collections = await sv . listCollections ();
console . log ( collections );
// ['agent-memory', 'user-profiles', 'documents']
collections = sv.list_collections()
print (collections)
# ['agent-memory', 'user-profiles', 'documents']
This lists collections in the current session. Full on-chain collection discovery is coming in a future release.
Collection Naming Best Practices
Use descriptive names
user-memories, product-embeddings, knowledge-baseNot: col1, vectors, data
Include environment
vectors-prod, vectors-staging, vectors-dev
Include version for schema changes
embeddings-v1, embeddings-v2Allows migration without downtime
Keep names short
Collection names are stored on-chain. Shorter names = lower costs. Max: 64 characters
Use kebab-case
user-memories, product-catalogNot: UserMemories, product_catalog
Collection Lifecycle
1. Create
const col = sv . collection ( 'my-vectors' , { dimensions: 1536 });
Collection is created lazily on first upsert.
2. Populate
await col . upsert ( vectors );
3. Query
const results = await col . query ({ vector: [ ... ], topK: 5 });
4. Update
// Upsert with existing IDs updates vectors
await col . upsert ([{ id: 'existing_id' , values: [ ... ] }]);
5. Delete Vectors
await col . delete ([ 'id1' , 'id2' ]);
6. Verify Integrity
const verification = await col . verify ();
console . log ( verification . verified ); // true
Advanced Patterns
Hybrid Collections
Store different content types in one collection:
await col . upsert ([
{
id: 'doc_1' ,
values: [ ... ],
metadata: { type: 'document' , title: '...' }
},
{
id: 'img_1' ,
values: [ ... ],
metadata: { type: 'image' , url: '...' }
},
{
id: 'msg_1' ,
values: [ ... ],
metadata: { type: 'message' , user: '...' }
}
]);
// Filter by type in queries
const docs = await col . query ({
vector: [ ... ],
topK: 5 ,
filter: { type: 'document' } // Alpha: filtering not yet implemented
});
Time-Partitioned Collections
function getCollectionForDate ( date : Date ) {
const month = date . toISOString (). slice ( 0 , 7 ); // '2024-03'
return sv . collection ( `vectors- ${ month } ` , { dimensions: 1536 });
}
const marchCol = getCollectionForDate ( new Date ( '2024-03-15' ));
await marchCol . upsert ( vectors );
Benefits:
Easier to delete old data
Improved query performance
Lower storage costs
Collection Aliases
const PRIMARY_COLLECTION = 'vectors-v2' ;
const FALLBACK_COLLECTION = 'vectors-v1' ;
const primary = sv . collection ( PRIMARY_COLLECTION , { dimensions: 1536 });
const fallback = sv . collection ( FALLBACK_COLLECTION , { dimensions: 1536 });
const results = await primary . query ({ vector: [ ... ], topK: 5 });
if ( results . matches . length === 0 ) {
// Fallback to old collection
const oldResults = await fallback . query ({ vector: [ ... ], topK: 5 });
}
Small Collections < 10K vectors Query time: < 5ms In-memory search
Medium Collections 10K - 100K vectors Query time: 5-20ms HNSW optimized
Large Collections 100K - 1M vectors Query time: 20-50ms Tune ef_search parameter
Huge Collections > 1M vectors Query time: 50-200ms Consider partitioning
Next Steps
Verification Learn how to verify collection integrity on-chain
Performance Tuning Optimize query speed and recall for large collections
TypeScript Guide Complete TypeScript SDK reference
Python Guide Complete Python SDK reference