Skip to main content

Collections Guide

Collections in VecLabs are equivalent to Pinecone indexes. Each collection stores vectors of a fixed dimension and uses a specific distance metric for similarity search.

What is a Collection?

A collection is an isolated namespace for storing and querying vectors:
  • Fixed dimension: All vectors must have the same dimension
  • Single distance metric: Cosine, Euclidean, or Dot Product
  • Unique namespace: Collections are isolated from each other
  • On-chain verification: Each collection has a Merkle root stored on Solana
Think of collections as separate tables in a database. Each collection is optimized for a specific embedding model and use case.

Creating Collections

import { SolVec } from 'solvec';

const sv = new SolVec({ network: 'devnet' });

const col = sv.collection('my-collection', {
  dimensions: 1536,
  metric: 'cosine'
});
Collections are created lazily. The collection is initialized on first upsert, not on collection() call.

Dimension Configuration

Common Embedding Dimensions

OpenAI text-embedding-3-small

1536 dimensionsFast, cost-effective, high quality

OpenAI text-embedding-3-large

3072 dimensionsHighest accuracy for critical tasks

Cohere embed-english-v3.0

1024 dimensionsOptimized for semantic search

Google Gemini embedding-001

768 dimensionsFast, free tier available

Setting Dimensions

// OpenAI text-embedding-3-small
const col1 = sv.collection('openai-small', { dimensions: 1536 });

// OpenAI text-embedding-3-large
const col2 = sv.collection('openai-large', { dimensions: 3072 });

// Cohere embed-english-v3.0
const col3 = sv.collection('cohere', { dimensions: 1024 });

// Custom dimension
const col4 = sv.collection('custom', { dimensions: 384 });
Default dimension is 1536 (OpenAI text-embedding-3-small). Only specify dimensions if you’re using a different model.

Distance Metrics

VecLabs supports three distance metrics:

1. Cosine Similarity (Default)

Best for: Most use cases, especially semantic search How it works: Measures the angle between vectors, ignoring magnitude Range: 0.0 to 1.0 (1.0 = identical direction)
const col = sv.collection('semantic-search', {
  dimensions: 1536,
  metric: 'cosine'
});
Use cases:
  • Document similarity
  • Semantic search
  • Recommendation systems
  • Any case where vector magnitude is not meaningful

2. Euclidean Distance

Best for: When absolute distance matters How it works: Measures straight-line distance between vectors Range: 0.0 to 1.0 (1.0 = closest, inverted for compatibility)
const col = sv.collection('spatial-vectors', {
  dimensions: 128,
  metric: 'euclidean'
});
Use cases:
  • Image embeddings
  • Audio similarity
  • When vector magnitude is meaningful

3. Dot Product

Best for: Pre-normalized vectors How it works: Computes dot product (sum of element-wise multiplication) Range: Unbounded (higher = more similar)
const col = sv.collection('normalized-vectors', {
  dimensions: 768,
  metric: 'dot'
});
Dot product assumes vectors are normalized to unit length. If vectors are not normalized, results may be unexpected.
Use cases:
  • Pre-normalized embeddings
  • When you need maximum query speed
  • Matrix factorization embeddings

Metric Comparison

MetricSpeedAccuracyRequires NormalizationBest For
CosineFastHighNoGeneral purpose, semantic search
EuclideanFastHighNoImage/audio, spatial data
Dot ProductFastestHighYesPre-normalized vectors

Collection Statistics

Get detailed information about a collection:
const stats = await col.describeIndexStats();

console.log(stats);
// {
//   vectorCount: 1542,
//   dimension: 1536,
//   metric: 'cosine',
//   name: 'agent-memory',
//   merkleRoot: 'a3f2c1e4...',
//   lastUpdated: 1678901234567,
//   isFrozen: false
// }

Stats Fields

  • vectorCount: Total number of vectors in the collection
  • dimension: Vector dimension (fixed at collection creation)
  • metric: Distance metric used
  • name: Collection name
  • merkleRoot: Current Merkle root hash
  • lastUpdated: Timestamp of last update (milliseconds)
  • isFrozen: Whether collection is frozen (coming soon)

Managing Multiple Collections

Use Case: Multiple Embedding Models

const sv = new SolVec({ network: 'devnet' });

// Separate collections for different models
const openai = sv.collection('openai-embeddings', { 
  dimensions: 1536,
  metric: 'cosine'
});

const cohere = sv.collection('cohere-embeddings', { 
  dimensions: 1024,
  metric: 'cosine'
});

// Each collection is independent
await openai.upsert([{ id: 'doc1', values: [...] }]);
await cohere.upsert([{ id: 'doc1', values: [...] }]);  // Same ID, different collection

Use Case: Development vs Production

const sv = new SolVec({ network: 'devnet' });

// Development collection
const devCol = sv.collection('vectors-dev', { dimensions: 1536 });

// Production collection (same network, different namespace)
const prodCol = sv.collection('vectors-prod', { dimensions: 1536 });

Use Case: Multi-Tenant Application

function getUserCollection(userId: string) {
  return sv.collection(`user-${userId}-memories`, { 
    dimensions: 1536,
    metric: 'cosine'
  });
}

const aliceCol = getUserCollection('alice');
const bobCol = getUserCollection('bob');

// Each user has isolated vectors
await aliceCol.upsert([{ id: 'mem_1', values: [...] }]);
await bobCol.upsert([{ id: 'mem_1', values: [...] }]);
Each collection incurs separate storage and query costs. Don’t create excessive collections unnecessarily.

Listing Collections

const collections = await sv.listCollections();
console.log(collections);
// ['agent-memory', 'user-profiles', 'documents']
This lists collections in the current session. Full on-chain collection discovery is coming in a future release.

Collection Naming Best Practices

1

Use descriptive names

user-memories, product-embeddings, knowledge-baseNot: col1, vectors, data
2

Include environment

vectors-prod, vectors-staging, vectors-dev
3

Include version for schema changes

embeddings-v1, embeddings-v2Allows migration without downtime
4

Keep names short

Collection names are stored on-chain. Shorter names = lower costs.Max: 64 characters
5

Use kebab-case

user-memories, product-catalogNot: UserMemories, product_catalog

Collection Lifecycle

1. Create

const col = sv.collection('my-vectors', { dimensions: 1536 });
Collection is created lazily on first upsert.

2. Populate

await col.upsert(vectors);

3. Query

const results = await col.query({ vector: [...], topK: 5 });

4. Update

// Upsert with existing IDs updates vectors
await col.upsert([{ id: 'existing_id', values: [...] }]);

5. Delete Vectors

await col.delete(['id1', 'id2']);

6. Verify Integrity

const verification = await col.verify();
console.log(verification.verified); // true

Advanced Patterns

Hybrid Collections

Store different content types in one collection:
await col.upsert([
  { 
    id: 'doc_1', 
    values: [...], 
    metadata: { type: 'document', title: '...' } 
  },
  { 
    id: 'img_1', 
    values: [...], 
    metadata: { type: 'image', url: '...' } 
  },
  { 
    id: 'msg_1', 
    values: [...], 
    metadata: { type: 'message', user: '...' } 
  }
]);

// Filter by type in queries
const docs = await col.query({
  vector: [...],
  topK: 5,
  filter: { type: 'document' }  // Alpha: filtering not yet implemented
});

Time-Partitioned Collections

function getCollectionForDate(date: Date) {
  const month = date.toISOString().slice(0, 7); // '2024-03'
  return sv.collection(`vectors-${month}`, { dimensions: 1536 });
}

const marchCol = getCollectionForDate(new Date('2024-03-15'));
await marchCol.upsert(vectors);
Benefits:
  • Easier to delete old data
  • Improved query performance
  • Lower storage costs

Collection Aliases

const PRIMARY_COLLECTION = 'vectors-v2';
const FALLBACK_COLLECTION = 'vectors-v1';

const primary = sv.collection(PRIMARY_COLLECTION, { dimensions: 1536 });
const fallback = sv.collection(FALLBACK_COLLECTION, { dimensions: 1536 });

const results = await primary.query({ vector: [...], topK: 5 });
if (results.matches.length === 0) {
  // Fallback to old collection
  const oldResults = await fallback.query({ vector: [...], topK: 5 });
}

Performance Considerations

Small Collections

< 10K vectorsQuery time: < 5msIn-memory search

Medium Collections

10K - 100K vectorsQuery time: 5-20msHNSW optimized

Large Collections

100K - 1M vectorsQuery time: 20-50msTune ef_search parameter

Huge Collections

> 1M vectorsQuery time: 50-200msConsider partitioning
For collections > 100K vectors, see the Performance Tuning Guide for optimization strategies.

Next Steps

Verification

Learn how to verify collection integrity on-chain

Performance Tuning

Optimize query speed and recall for large collections

TypeScript Guide

Complete TypeScript SDK reference

Python Guide

Complete Python SDK reference

Build docs developers (and LLMs) love