Skip to main content
Production RAG pipelines need cost controls. VectorDB provides cost-optimized strategies that reduce API calls, minimize token usage, and enable budget-aware retrieval without sacrificing quality.

Cost breakdown

Typical RAG pipeline costs:
Query embedding:     $0.0001  (API-based embedder)
Sparse embedding:    $0.0000  (local TF-IDF)
Vector search:       $0.0000  (included in DB pricing)
Reranking (API):     $0.0020  (per 1000 docs)
LLM generation:      $0.0050  (per 1000 tokens)
─────────────────────────────
Total per query:     ~$0.0071
At 1M queries/month: $7,100/month

Cost-optimized strategies

1. Hybrid search with local sparse embeddings

Use local TF-IDF for sparse embeddings instead of API-based models:
# Standard approach (API cost for both dense and sparse)
from vectordb.haystack.hybrid_indexing import MilvusHybridSearchPipeline

pipeline = MilvusHybridSearchPipeline("configs/milvus_triviaqa.yaml")
result = pipeline.run(query="machine learning", top_k=10)
Savings:
  • Dense embedding: $0.0001 (API)
  • Sparse embedding: $0.0000 (local)
  • 50% reduction in embedding costs

2. Optional generation

Allow retrieval-only mode to skip LLM generation when not needed:
from vectordb.langchain.cost_optimized_rag.search import (
    ChromaCostOptimizedRAGSearchPipeline
)

pipeline = ChromaCostOptimizedRAGSearchPipeline("config.yaml")

# Retrieval only (no LLM cost)
result = pipeline.search(query="photosynthesis", top_k=5)
print(result["documents"])  # Just documents, no answer

# With generation (LLM cost incurred)
if user_needs_answer:
    answer = llm.generate(query, result["documents"])
Savings:
  • Skip generation for 60% of queries (users just browse documents)
  • 60% reduction in generation costs

3. Batch processing

Embed and search multiple queries in a single batch:
from vectordb.langchain.utils import EmbedderHelper

# Batch embedding (lower API cost)
queries = [
    "What is AI?",
    "Explain neural networks",
    "How does backpropagation work?"
]

# Single API call for all queries
embeddings = embedder.embed_documents(queries)

# Batch search
results = [
    vector_db.search(emb, top_k=10)
    for emb in embeddings
]
Savings:
  • Batch embedding: 70% cheaper than individual calls
  • Reduced API overhead

4. Result caching

Cache frequent queries to avoid repeated searches:
from functools import lru_cache

class CachedRAGPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.cache = {}
    
    @lru_cache(maxsize=1000)
    def search(self, query: str, top_k: int = 10):
        cache_key = f"{query}_{top_k}"
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        result = self.pipeline.search(query, top_k)
        self.cache[cache_key] = result
        return result
Savings:
  • 30-40% cache hit rate for typical applications
  • 30-40% reduction in total costs

5. Pre-filtering before retrieval

Narrow search space with metadata filters to reduce results:
# Unfiltered search (processes all documents)
results = pipeline.search("machine learning", top_k=100)

# Pre-filtered search (smaller search space)
results = pipeline.search(
    "machine learning",
    top_k=10,
    filters={
        "category": "technology",
        "date": {"$gte": "2023-01-01"}
    }
)
Savings:
  • Smaller result sets reduce reranking costs
  • Faster searches reduce compute costs

Configuration

Cost-optimized setup
pinecone:
  api_key: ${PINECONE_API_KEY}
  index_name: cost-optimized
  namespace: default
  dimension: 384

embedding:
  provider: sentence_transformers  # Local, zero cost
  model: all-MiniLM-L6-v2

search:
  rrf_k: 60
  cache_enabled: true
  cache_ttl: 3600  # 1 hour

llm:
  provider: groq  # Cost-effective generation
  model: llama-3.3-70b-versatile
  api_key: ${GROQ_API_KEY}
  optional: true  # Allow skipping generation

reranking:
  enabled: false  # Disable for cost savings
  # Or use local cross-encoder:
  # enabled: true
  # type: cross_encoder
  # model: BAAI/bge-reranker-base

Implementation: Pinecone cost-optimized pipeline

Here’s how the LangChain cost-optimized pipeline reduces costs:
from vectordb.databases.pinecone import PineconeVectorDB
from vectordb.langchain.utils import (
    ConfigLoader,
    EmbedderHelper,
    SparseEmbedder,
    ResultMerger,
    RAGHelper
)

class PineconeCostOptimizedRAGSearchPipeline:
    def __init__(self, config_or_path):
        self.config = ConfigLoader.load(config_or_path)
        
        # Dense embedder (API-based, required)
        self.dense_embedder = EmbedderHelper.create_embedder(self.config)
        
        # Sparse embedder (local, zero cost)
        self.sparse_embedder = SparseEmbedder()
        
        # Pinecone connection
        self.db = PineconeVectorDB(
            api_key=self.config["pinecone"]["api_key"],
            index_name=self.config["pinecone"]["index_name"]
        )
        
        # Optional LLM (can be disabled to save cost)
        self.llm = RAGHelper.create_llm(self.config)
    
    def search(self, query, top_k=10, filters=None):
        # Generate embeddings
        # Dense: 1 API call
        dense_query_embedding = EmbedderHelper.embed_query(
            self.dense_embedder, query
        )
        # Sparse: local, zero cost
        sparse_query_embedding = self.sparse_embedder.embed_query(query)
        
        # Execute dual searches
        dense_docs = self.db.query(
            vector=dense_query_embedding,
            top_k=top_k,
            filter=filters
        )
        
        sparse_docs = self.db.query_with_sparse(
            vector=[0.0] * self.dimension,  # Placeholder
            sparse_vector=sparse_query_embedding,
            top_k=top_k,
            filter=filters
        )
        
        # Fuse results (local RRF, zero cost)
        merged = ResultMerger.merge_and_deduplicate(
            [dense_docs, sparse_docs],
            method="rrf",
            weights=[0.5, 0.5]
        )
        
        result = {"documents": merged[:top_k], "query": query}
        
        # Optional generation (controlled by config)
        if self.llm is not None:
            answer = RAGHelper.generate(self.llm, query, merged[:top_k])
            result["answer"] = answer
        
        return result

Local sparse embedding

The SparseEmbedder uses TF-IDF locally:
from sklearn.feature_extraction.text import TfidfVectorizer

class SparseEmbedder:
    def __init__(self, max_features=5000):
        self.vectorizer = TfidfVectorizer(
            max_features=max_features,
            stop_words="english"
        )
    
    def embed_query(self, query: str) -> dict:
        # Fit and transform query (local operation)
        tfidf_matrix = self.vectorizer.fit_transform([query])
        
        # Convert to sparse vector format for Pinecone
        indices = tfidf_matrix.indices.tolist()
        values = tfidf_matrix.data.tolist()
        
        return {"indices": indices, "values": values}
    
    def embed_documents(self, texts: list[str]) -> list[dict]:
        # Batch fit and transform
        tfidf_matrix = self.vectorizer.fit_transform(texts)
        
        sparse_vectors = []
        for i in range(len(texts)):
            row = tfidf_matrix[i]
            sparse_vectors.append({
                "indices": row.indices.tolist(),
                "values": row.data.tolist()
            })
        
        return sparse_vectors

Cost monitoring

Track API usage and estimated costs:
class CostTracker:
    def __init__(self):
        self.embedding_calls = 0
        self.reranking_calls = 0
        self.generation_tokens = 0
    
    def track_embedding(self, num_queries=1):
        self.embedding_calls += num_queries
    
    def track_reranking(self, num_docs=0):
        self.reranking_calls += num_docs
    
    def track_generation(self, tokens=0):
        self.generation_tokens += tokens
    
    def estimate_cost(self):
        # Pricing (example rates)
        embedding_cost = self.embedding_calls * 0.0001
        reranking_cost = (self.reranking_calls / 1000) * 2.0
        generation_cost = (self.generation_tokens / 1000) * 0.005
        
        return {
            "embedding": embedding_cost,
            "reranking": reranking_cost,
            "generation": generation_cost,
            "total": embedding_cost + reranking_cost + generation_cost
        }

# Usage
tracker = CostTracker()

for query in queries:
    tracker.track_embedding()
    result = pipeline.search(query)
    tracker.track_generation(len(result["answer"].split()))

print(tracker.estimate_cost())
Output:
{
    "embedding": 0.010,
    "reranking": 0.000,
    "generation": 0.125,
    "total": 0.135
}

Comparison: standard vs. cost-optimized

Per query:
  • Dense embedding (API): $0.0001
  • Sparse embedding (API): $0.0001
  • Reranking (Cohere): $0.0020
  • Generation (GPT-4): $0.0150
Total: $0.0172 per queryAt 1M queries/month: $17,200/month

Budget controls

Implement hard limits on costs:
class BudgetController:
    def __init__(self, daily_budget=100.0):
        self.daily_budget = daily_budget
        self.daily_spend = 0.0
        self.last_reset = datetime.now().date()
    
    def check_budget(self, estimated_cost):
        # Reset counter if new day
        if datetime.now().date() > self.last_reset:
            self.daily_spend = 0.0
            self.last_reset = datetime.now().date()
        
        # Check if query would exceed budget
        if self.daily_spend + estimated_cost > self.daily_budget:
            raise BudgetExceededError(
                f"Daily budget ${self.daily_budget} would be exceeded"
            )
        
        self.daily_spend += estimated_cost
    
    def get_remaining_budget(self):
        return self.daily_budget - self.daily_spend

# Usage
budget = BudgetController(daily_budget=50.0)

try:
    budget.check_budget(0.01)  # Estimated cost for this query
    result = pipeline.search(query)
except BudgetExceededError:
    # Return cached result or error message
    result = {"error": "Daily budget exceeded"}

Best practices

Use local embedders

SentenceTransformers models run locally with zero API cost. Quality is comparable to API embedders for most use cases.

Cache aggressively

30-40% of queries are repeats. LRU cache with 1-hour TTL reduces costs significantly.

Skip unnecessary steps

Don’t rerank if initial retrieval quality is high. Don’t generate if users just need documents.

Batch when possible

Batch embedding reduces API overhead by 70%. Use for background indexing and bulk queries.

Monitor and optimize

Track cost per query. Identify expensive operations and optimize hot paths.

Choose cost-effective LLMs

Groq’s Llama 3.3 is 30x cheaper than GPT-4 with comparable quality for most RAG tasks.

See also

Build docs developers (and LLMs) love