Supported databases
ChromaDB
Local embedded database, auto-provisioned via Docker
Weaviate
Remote or local, supports GraphQL and REST
Qdrant
High-performance vector search (coming soon)
Pinecone
Managed cloud service (planned)
ChromaDB (local)
ChromaDB is automatically provisioned when creating a local dataset.Auto-provisioning
When you create a dataset with typelocal_file:
- Pulls
chromadb/chroma:latestDocker image - Starts container with unique name and port
- Creates persistent volume for data
- Monitors container health
- Stores connection info in database
Manual connection
Connect to existing ChromaDB instance:Container lifecycle
Provisioner states:pending- Provisioning requestedrunning- Container active and healthystopped- Container stoppedfailed- Provisioning failedcleanup- Cleanup in progress
Resource management
ChromaDB containers use:- CPU: 0.5 cores (soft limit)
- Memory: 512MB (soft limit)
- Storage: Persistent Docker volume
- Ports: Automatically allocated (8000+)
Check provisioner status via
GET /api/v1/datasets/{name} endpoint.Weaviate (remote)
Connect to existing Weaviate instance (local or cloud).Configuration
Weaviate cloud
Using Weaviate Cloud Services (WCS):Create cluster
Sign up at Weaviate Cloud and create a cluster
Local Weaviate
Run Weaviate locally with Docker:Qdrant (coming soon)
Qdrant support planned for future release. Planned features:- Auto-provisioning like ChromaDB
- Quantization for reduced memory
- Payload indexing for filtering
- Batch operations for performance
Embedding models
Vector databases require embeddings for semantic search.OpenAI embeddings
Used with ChromaDB:text-embedding-3-small- 1536 dimensions, low costtext-embedding-3-large- 3072 dimensions, high qualitytext-embedding-ada-002- Legacy, 1536 dimensions
Weaviate vectorizers
Weaviate can use built-in vectorizers:text2vec-openai- OpenAI embeddingstext2vec-cohere- Cohere embeddingstext2vec-huggingface- Local modelstext2vec-transformers- Local transformers
Document ingestion
Documents are processed and embedded during ingestion:Chunking strategy
- Extract text from files (PDF, DOCX, TXT, etc.)
- Split into overlapping chunks
- Generate embeddings for each chunk
- Store in vector database with metadata
Supported file formats
- Text: .txt, .md, .csv
- Documents: .pdf, .docx, .odt
- Code: .py, .js, .java, .cpp
- Data: .json, .xml, .yaml
Search and retrieval
Vector databases enable semantic search:Similarity search
similarity_threshold- Minimum similarity score (0.0-1.0)limit- Maximum results to returninclude_metadata- Include document metadata
Hybrid search
Combine vector search with keyword filtering:Performance optimization
Indexing
ChromaDB:- HNSW (Hierarchical Navigable Small World) index
- Automatic index building
- In-memory for fast queries
- HNSW index with configurable parameters
- Disk-based with caching
- Quantization for memory efficiency
Batch operations
Ingest documents in batches:Caching
Cache frequently accessed results:- Query result caching (5 min TTL)
- Embedding caching for common queries
- Metadata caching
Monitoring
Track database health and performance:Health checks
Metrics
Track in application logs:- Query latency (p50, p95, p99)
- Index size and growth
- Memory usage
- Document count
- Failed queries
Backup and restore
ChromaDB
Backup Docker volume:Weaviate
Use Weaviate backup module:Troubleshooting
Container won't start
Container won't start
Check Docker logs:Common issues:
- Port already in use
- Insufficient memory
- Docker daemon not running
Slow queries
Slow queries
- Reduce
limitparameter - Increase
similarity_threshold - Check database size (reindex if needed)
- Monitor container resources
Connection errors
Connection errors
- Verify network connectivity
- Check firewall rules
- Validate API keys/credentials
- Ensure service is running