Skip to main content
Semantic recall enables agents to retrieve contextually relevant messages from conversation history using vector embeddings and similarity search. This provides long-term memory beyond recent message limits.

How It Works

The SemanticRecall processor operates as both an input and output processor:
  1. On Input: Performs semantic search on historical messages and adds relevant context
  2. On Output: Creates embeddings for new messages to enable future semantic search

Basic Configuration

Enable semantic recall with vector storage and an embedder:
import { Memory } from '@mastra/core';
import { PgVector } from '@mastra/vector-pg';
import { LibSQLStore } from '@mastra/store-libsql';

const memory = new Memory({
  storage: new LibSQLStore({
    id: 'agent-memory',
    url: 'file:./memory.db'
  }),
  vector: new PgVector({
    connectionString: process.env.DATABASE_URL
  }),
  embedder: 'openai/text-embedding-3-small',
  options: {
    lastMessages: 10,
    semanticRecall: {
      topK: 5,
      messageRange: 2,
      scope: 'resource'
    }
  }
});

Configuration Options

semanticRecall
boolean | SemanticRecall
Enable semantic recall with defaults (true) or configure with detailed options
topK
number
default:"4"
Number of most similar messages to retrieve from the vector database
messageRange
number | { before: number; after: number }
default:"1"
Amount of surrounding context to include with each retrieved message
scope
'thread' | 'resource'
default:"'resource'"
Scope of semantic search:
  • thread: Search only within the current conversation thread
  • resource: Search across all threads owned by the user/resource
threshold
number
Minimum similarity score (0-1). Messages below this threshold are filtered out.
indexConfig
VectorIndexConfig
Vector index configuration (PostgreSQL-specific). See index optimization below.

Configuration Examples

Simple Setup

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: true // Enable with defaults
  }
});

Advanced Configuration

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-large',
  embedderOptions: {
    providerOptions: {
      openai: {
        dimensions: 1536 // Custom embedding dimensions
      }
    }
  },
  options: {
    lastMessages: 10,
    semanticRecall: {
      topK: 8,
      messageRange: { before: 2, after: 3 },
      scope: 'resource',
      threshold: 0.7,
      indexConfig: {
        type: 'hnsw',
        metric: 'dotproduct',
        hnsw: {
          m: 16,
          efConstruction: 64
        }
      }
    }
  }
});

Thread-Scoped Recall

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: {
      topK: 5,
      scope: 'thread' // Only search current thread
    }
  }
});

Vector Store Setup

Semantic recall requires a vector database. Mastra supports multiple providers:
import { PgVector } from '@mastra/vector-pg';

const vector = new PgVector({
  connectionString: process.env.DATABASE_URL
});

Embedder Configuration

Choose an embedding model compatible with your use case:
const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  embedderOptions: {
    providerOptions: {
      openai: {
        dimensions: 1536
      }
    }
  },
  options: {
    semanticRecall: true
  }
});

Index Optimization

For PostgreSQL with pgvector, you can optimize semantic recall performance with index configuration:
const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: {
      topK: 5,
      indexConfig: {
        type: 'hnsw', // Hierarchical Navigable Small World
        metric: 'dotproduct', // Best for OpenAI embeddings
        hnsw: {
          m: 16, // Links per node
          efConstruction: 64 // Construction quality
        }
      }
    }
  }
});
Index Types:
  • hnsw: Best performance for most cases (recommended)
  • ivfflat: Good balance of speed and recall
  • flat: Exact nearest neighbor (slow but 100% recall)

Cross-Thread Recall

When using scope: 'resource', semantic recall can retrieve messages from other threads:
const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: {
      topK: 5,
      messageRange: 2,
      scope: 'resource' // Search across all user threads
    }
  }
});

const agent = new Agent({
  name: 'Assistant',
  model: 'openai/gpt-4o',
  memory
});

// Query references information from previous conversations
const result = await agent.generate(
  'What did I say about my dietary preferences?',
  {
    threadId: 'current-thread',
    resourceId: 'user-123'
  }
);
Cross-thread messages are formatted with timestamps:
The following messages were remembered from a different conversation:
<remembered_from_other_conversation>

the following messages are from 2024, Feb, 15
Message from previous conversation at 3:45 PM: User: I'm allergic to peanuts
Message from previous conversation at 3:46 PM: Assistant: I'll make sure to avoid peanuts in all recommendations

<end_remembered_from_other_conversation>

Embedding Cache

SemanticRecall uses a global embedding cache to avoid redundant API calls:
import { globalEmbeddingCache } from '@mastra/core/processors';

// Clear cache if needed
globalEmbeddingCache.clear();

// Check cache size
console.log(`Cache size: ${globalEmbeddingCache.size}`);
The cache uses xxhash for fast key generation and includes the index name to ensure isolation between different embedding models/dimensions.

Implementation Details

The SemanticRecall processor handles semantic search and embedding creation:
async processInput(args) {
  const { messages, messageList, requestContext } = args;
  
  // Extract user query from last user message
  const userQuery = this.extractUserQuery(messages);
  if (!userQuery) return messageList;
  
  // Generate embeddings for the query
  const { embeddings, dimension } = await this.embedMessageContent(
    userQuery,
    indexName
  );
  
  // Ensure vector index exists
  await this.ensureVectorIndex(indexName, dimension);
  
  // Perform vector search
  const results = await this.vector.query({
    indexName,
    queryVector: embeddings[0],
    topK: this.topK,
    filter: this.scope === 'resource' 
      ? { resource_id: resourceId } 
      : { thread_id: threadId }
  });
  
  // Retrieve messages with context
  const similarMessages = await this.storage.listMessages({
    threadId,
    resourceId,
    include: results.map(r => ({
      id: r.metadata?.message_id,
      threadId: r.metadata?.thread_id,
      withNextMessages: this.messageRange.after,
      withPreviousMessages: this.messageRange.before
    }))
  });
  
  // Add to message list
  messageList.add(similarMessages, 'memory');
  return messageList;
}

Best Practices

Choose the Right Scope

Use resource scope for cross-conversation context, thread scope for session-specific recall.

Tune TopK

Start with 3-5 similar messages. More results increase context but also token usage.

Set a Threshold

Filter low-quality matches with a similarity threshold (e.g., 0.7).

Optimize Indexes

Use HNSW indexes for PostgreSQL to improve query performance.

Troubleshooting

  • Check that embeddings were created (verify vector store has data)
  • Lower the threshold value if set
  • Ensure scope matches your use case (thread vs resource)
  • Verify embedder dimensions match vector store index
  • Use HNSW index type for PostgreSQL
  • Reduce topK value
  • Check vector store connection and query performance
  • Consider using a smaller embedding model
  • Reduce topK (fewer messages retrieved)
  • Reduce messageRange (less surrounding context)
  • Increase threshold (only highly relevant matches)
  • Balance with lastMessages to avoid redundancy

Next Steps

Working Memory

Store structured user information across conversations

RAG Overview

Learn about document-based RAG in Mastra

Conversation History

Manage recent message persistence

Build docs developers (and LLMs) love