Optimize Memori for speed, efficiency, and cost-effectiveness
Memori is designed for high performance out of the box, but you can optimize it further for your specific workload. This guide covers configuration, best practices, and advanced techniques.
Default:1000Controls the maximum number of embeddings to search during memory recall.
Higher values: More comprehensive search, potentially better recall, slower queries
Lower values: Faster queries, lower quota usage, potentially less accurate recall
# For high-throughput applicationsmem.config.recall_embeddings_limit = 500# For maximum accuracymem.config.recall_embeddings_limit = 5000
Recommended: Start with 1000. Reduce to 500 if you need faster queries and have well-organized entity/process attribution. Increase to 2000+ only if recall accuracy is insufficient.
Memori uses sentence-transformers models for local embedding generation. Choose based on your speed vs. accuracy requirements:
Model
Dimensions
Speed
Accuracy
Use Case
all-MiniLM-L6-v2
384
⭐️⭐️⭐️
⭐️⭐️
Default, fast, good quality
all-mpnet-base-v2
768
⭐️⭐️
⭐️⭐️⭐️
Higher accuracy, slower
all-MiniLM-L12-v2
384
⭐️⭐️
⭐️⭐️
Balanced option
Configure via environment variable:
# Fast and efficient (default)export MEMORI_EMBEDDINGS_MODEL="all-MiniLM-L6-v2"# Best accuracyexport MEMORI_EMBEDDINGS_MODEL="all-mpnet-base-v2"# Install the modelpython -m memori setup
Or programmatically:
from memori import Memorimem = Memori()mem.config.embeddings.model = "all-mpnet-base-v2"
Changing embedding models requires recomputing all existing embeddings. This is handled automatically but may cause a temporary performance impact. Stick with one model in production.
-- Entity and process lookupsCREATE INDEX idx_entity_id ON memories(entity_id);CREATE INDEX idx_process_id ON memories(process_id);CREATE INDEX idx_session_id ON memories(session_id);-- Composite index for common queriesCREATE INDEX idx_entity_process ON memories(entity_id, process_id);-- Timestamp-based queriesCREATE INDEX idx_created_at ON memories(created_at DESC);
For streaming responses, Memori processes chunks efficiently:
from memori import Memorifrom openai import OpenAIclient = OpenAI()mem = Memori().llm.register(client)mem.attribution(entity_id="user_123", process_id="streaming_agent")stream = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": "Tell me a story"}], stream=True)# Stream to user immediately, Memori records in backgroundfor chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)
from memori import Memorifrom openai import OpenAImem = Memori()client = mem.llm.register(OpenAI())mem.attribution(entity_id="user_123", process_id="agent")# Get current session IDsession_id = mem.config.session_id# Process multiple messages in same sessionfor message in user_messages: mem.set_session(session_id) # Reuse session response = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": message}] )# Start new session for new topicmem.new_session()
import asynciofrom memori import Memorifrom openai import AsyncOpenAIasync def process_batch(items, batch_size=10): client = AsyncOpenAI() mem = Memori().llm.register(client) for i in range(0, len(items), batch_size): batch = items[i:i+batch_size] # Process batch in parallel tasks = [ client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": item}] ) for item in batch ] results = await asyncio.gather(*tasks) # Small delay between batches await asyncio.sleep(0.1) return results# Process 100 items in batches of 10items = [f"Item {i}" for i in range(100)]results = asyncio.run(process_batch(items, batch_size=10))