Overview
Azen uses semantic search to find memories based on meaning, not just keyword matching. This is powered by OpenAI embeddings and Pinecone vector database.How Semantic Search Works
Traditional keyword search finds exact word matches. Semantic search understands:- Synonyms (“car” matches “automobile”)
- Context (“bank” as financial vs. river)
- Intent (“how to reset password” matches “password recovery steps”)
Vector Embeddings
What are Embeddings?
Embeddings are numerical representations of text in high-dimensional space. Similar meanings cluster together:OpenAI text-embedding-3-small
Azen uses OpenAI’stext-embedding-3-small model (apps/api/src/lib/vector.ts:12-16):
- Dimensions: 1536
- Max Input: 8191 tokens
- Performance: ~62.3% on MTEB benchmark
- Cost: $0.02 per 1M tokens
- Speed: ~200ms for batch of 10 texts
text-embedding-3-small balances performance and cost. For higher accuracy, you can swap to text-embedding-3-large (3072 dimensions).Embedding Generation Pipeline
Text Chunking
Large texts are split into chunks before embedding (apps/api/src/lib/chunk.ts:5-13):
- Max Tokens: 512 (smaller than model’s 8191 limit for better precision)
- Overlap: 50 tokens (maintains context between chunks)
- Tokenizer: GPT-4o encoding via
js-tiktoken
Why Chunk?
- Better Search Precision: Small chunks are more focused
- Relevance Scoring: Each chunk can be scored independently
- Performance: Smaller vectors are faster to compute
- Context Preservation: Overlap prevents information loss at boundaries
Batch Processing
Embeddings are generated in batches (apps/api/src/jobs/embed-job.ts:20-22):
- Single memory with 10 chunks → 1 API call (not 10)
- OpenAI accepts up to 2048 inputs per request
Vector Storage in Pinecone
Pinecone Configuration
Azen uses Pinecone as the vector database (apps/api/src/lib/vector.ts:2-9):
- Namespaces: Organization-level data isolation
- Metadata: Store memory ID and chunk index
- Similarity Metric: Cosine similarity (default)
Upserting Vectors
Vectors are uploaded with metadata (apps/api/src/lib/vector.ts:19-30):
Vector IDs follow the pattern
{memoryId}::{chunkIndex}. This allows reconstruction of which chunks belong to which memory.Namespace Strategy
Each organization gets a dedicated namespace:- Data Isolation: Organizations can’t access each other’s vectors
- Performance: Smaller search space per query
- Compliance: Supports data residency and deletion requirements
- Scaling: Distribute vectors across namespace shards
Search Query Flow
The complete search flow is implemented inapps/api/src/routes/search.ts:18-93.
1. Receive Search Query
2. Embed the Query
3. Vector Similarity Search
apps/api/src/lib/vector.ts:32-39):
vector: Query embedding (1536 dimensions)topK: Number of results to return (default 5, max 50)includeMetadata: Set to false for performance (we only need IDs)
4. Extract Memory IDs
Chunk IDs are parsed to find unique memories:5. Fetch from Database
6. Decrypt and Order Results
memIds order).
7. Return Response
Similarity Scoring
Pinecone uses cosine similarity to measure vector closeness:1.0: Identical vectors (perfect match)0.9-1.0: Very similar (strong semantic match)0.7-0.9: Moderately similar (related concepts)0.0-0.7: Weakly similar or unrelated-1.0-0.0: Opposite meaning (rare with embeddings)
In practice, most relevant results score above 0.75. Scores below 0.6 are typically noise.
Vector Deletion
When a memory is deleted, its vectors must be removed (apps/api/src/lib/vector.ts:41-43):
memoryId.
Performance Characteristics
Query Latency
| Operation | Latency |
|---|---|
| Embed query (1 text) | ~100-200ms |
| Pinecone search (topK=5) | ~50-150ms |
| Database fetch (5 memories) | ~10-50ms |
| Decrypt (5 memories) | ~1-5ms |
| Total | ~160-405ms |
Throughput
- Embedding: ~500 texts/second (batched)
- Pinecone Upsert: ~1000 vectors/second
- Pinecone Query: ~100 queries/second per namespace
Scaling Considerations
Pinecone Limits:- Free tier: 100k vectors, 5 queries/second
- Paid tier: Unlimited vectors, configurable QPS
- Pod-based: Dedicated compute, ~10k QPS
- Cache Embeddings: Store query embeddings for common searches
- Reduce topK: Smaller result sets are faster
- Batch Queries: Process multiple searches in parallel
- Index Tuning: Adjust Pinecone pod type and replicas
Search Quality
Factors Affecting Relevance
- Chunk Size: Smaller chunks are more precise but may lose context
- Overlap: More overlap improves recall but increases storage
- Model Choice: Larger models (3-large) are more accurate
- topK Value: More results increase recall but add noise
Improving Search Results
Query Rewriting:- Expand abbreviations (“ML” → “machine learning”)
- Add context (“reset password” → “how to reset user password”)
- Combine vector search with keyword matching
- Use Pinecone metadata filters for structured constraints
- Apply a cross-encoder model to re-score results
- Filter by date, user preferences, or metadata
Privacy and Security
What OpenAI Sees
OpenAI receives plaintext content to generate embeddings:- Memory content (not encrypted)
- Search queries (not encrypted)
OpenAI’s API Terms: Data submitted via API is not used to train models (as of 2024). Check their current data usage policy.
What Pinecone Sees
Pinecone stores:- Vector embeddings (mathematical representations)
- Metadata (memory ID, chunk index)
- Does not store original text
Threat Model
Protected:- Embedding vectors are namespaced by organization
- Queries cannot access other organizations’ data
- Database enforces additional organization filtering
- OpenAI sees plaintext during embedding generation
- Pinecone can infer topics from embedding patterns
- Embedding vectors could be reverse-engineered (difficult but possible)
Related Concepts
- Memory System - How embeddings are generated asynchronously
- Encryption - Why embeddings are not encrypted
- Organizations - How namespaces provide data isolation

