Semantic Recall

Semantic recall enables agents to retrieve contextually relevant messages from conversation history using vector embeddings and similarity search. This provides long-term memory beyond recent message limits.

How It Works

The SemanticRecall processor operates as both an input and output processor:

On Input: Performs semantic search on historical messages and adds relevant context
On Output: Creates embeddings for new messages to enable future semantic search

Basic Configuration

Enable semantic recall with vector storage and an embedder:

import { Memory } from '@mastra/core';
import { PgVector } from '@mastra/vector-pg';
import { LibSQLStore } from '@mastra/store-libsql';

const memory = new Memory({
  storage: new LibSQLStore({
    id: 'agent-memory',
    url: 'file:./memory.db'
  }),
  vector: new PgVector({
    connectionString: process.env.DATABASE_URL
  }),
  embedder: 'openai/text-embedding-3-small',
  options: {
    lastMessages: 10,
    semanticRecall: {
      topK: 5,
      messageRange: 2,
      scope: 'resource'
    }
  }
});

Configuration Options

semanticRecall

boolean | SemanticRecall

Enable semantic recall with defaults (true) or configure with detailed options

topK

number

default:"4"

Number of most similar messages to retrieve from the vector database

messageRange

number | { before: number; after: number }

default:"1"

Amount of surrounding context to include with each retrieved message

scope

'thread' | 'resource'

default:"'resource'"

Scope of semantic search:

thread: Search only within the current conversation thread
resource: Search across all threads owned by the user/resource

threshold

number

Minimum similarity score (0-1). Messages below this threshold are filtered out.

indexConfig

VectorIndexConfig

Vector index configuration (PostgreSQL-specific). See index optimization below.

Configuration Examples

Simple Setup

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: true // Enable with defaults
  }
});

Advanced Configuration

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-large',
  embedderOptions: {
    providerOptions: {
      openai: {
        dimensions: 1536 // Custom embedding dimensions
      }
    }
  },
  options: {
    lastMessages: 10,
    semanticRecall: {
      topK: 8,
      messageRange: { before: 2, after: 3 },
      scope: 'resource',
      threshold: 0.7,
      indexConfig: {
        type: 'hnsw',
        metric: 'dotproduct',
        hnsw: {
          m: 16,
          efConstruction: 64
        }
      }
    }
  }
});

Thread-Scoped Recall

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: {
      topK: 5,
      scope: 'thread' // Only search current thread
    }
  }
});

Vector Store Setup

Semantic recall requires a vector database. Mastra supports multiple providers:

import { PgVector } from '@mastra/vector-pg';

const vector = new PgVector({
  connectionString: process.env.DATABASE_URL
});

Embedder Configuration

Choose an embedding model compatible with your use case:

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  embedderOptions: {
    providerOptions: {
      openai: {
        dimensions: 1536
      }
    }
  },
  options: {
    semanticRecall: true
  }
});

Index Optimization

For PostgreSQL with pgvector, you can optimize semantic recall performance with index configuration:

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: {
      topK: 5,
      indexConfig: {
        type: 'hnsw', // Hierarchical Navigable Small World
        metric: 'dotproduct', // Best for OpenAI embeddings
        hnsw: {
          m: 16, // Links per node
          efConstruction: 64 // Construction quality
        }
      }
    }
  }
});

Index Types:

hnsw: Best performance for most cases (recommended)
ivfflat: Good balance of speed and recall
flat: Exact nearest neighbor (slow but 100% recall)

Cross-Thread Recall

When using scope: 'resource', semantic recall can retrieve messages from other threads:

const memory = new Memory({
  storage,
  vector,
  embedder: 'openai/text-embedding-3-small',
  options: {
    semanticRecall: {
      topK: 5,
      messageRange: 2,
      scope: 'resource' // Search across all user threads
    }
  }
});

const agent = new Agent({
  name: 'Assistant',
  model: 'openai/gpt-4o',
  memory
});

// Query references information from previous conversations
const result = await agent.generate(
  'What did I say about my dietary preferences?',
  {
    threadId: 'current-thread',
    resourceId: 'user-123'
  }
);

Cross-thread messages are formatted with timestamps:

The following messages were remembered from a different conversation:
<remembered_from_other_conversation>

the following messages are from 2024, Feb, 15
Message from previous conversation at 3:45 PM: User: I'm allergic to peanuts
Message from previous conversation at 3:46 PM: Assistant: I'll make sure to avoid peanuts in all recommendations

<end_remembered_from_other_conversation>

Embedding Cache

SemanticRecall uses a global embedding cache to avoid redundant API calls:

import { globalEmbeddingCache } from '@mastra/core/processors';

// Clear cache if needed
globalEmbeddingCache.clear();

// Check cache size
console.log(`Cache size: ${globalEmbeddingCache.size}`);

The cache uses xxhash for fast key generation and includes the index name to ensure isolation between different embedding models/dimensions.

Implementation Details

The SemanticRecall processor handles semantic search and embedding creation:

async processInput(args) {
  const { messages, messageList, requestContext } = args;
  
  // Extract user query from last user message
  const userQuery = this.extractUserQuery(messages);
  if (!userQuery) return messageList;
  
  // Generate embeddings for the query
  const { embeddings, dimension } = await this.embedMessageContent(
    userQuery,
    indexName
  );
  
  // Ensure vector index exists
  await this.ensureVectorIndex(indexName, dimension);
  
  // Perform vector search
  const results = await this.vector.query({
    indexName,
    queryVector: embeddings[0],
    topK: this.topK,
    filter: this.scope === 'resource' 
      ? { resource_id: resourceId } 
      : { thread_id: threadId }
  });
  
  // Retrieve messages with context
  const similarMessages = await this.storage.listMessages({
    threadId,
    resourceId,
    include: results.map(r => ({
      id: r.metadata?.message_id,
      threadId: r.metadata?.thread_id,
      withNextMessages: this.messageRange.after,
      withPreviousMessages: this.messageRange.before
    }))
  });
  
  // Add to message list
  messageList.add(similarMessages, 'memory');
  return messageList;
}

Best Practices

Choose the Right Scope

Use resource scope for cross-conversation context, thread scope for session-specific recall.

Tune TopK

Start with 3-5 similar messages. More results increase context but also token usage.

Set a Threshold

Filter low-quality matches with a similarity threshold (e.g., 0.7).

Optimize Indexes

Use HNSW indexes for PostgreSQL to improve query performance.

Troubleshooting

No results returned

Check that embeddings were created (verify vector store has data)
Lower the threshold value if set
Ensure scope matches your use case (thread vs resource)
Verify embedder dimensions match vector store index

Slow query performance

Use HNSW index type for PostgreSQL
Reduce topK value
Check vector store connection and query performance
Consider using a smaller embedding model

High token usage

Reduce topK (fewer messages retrieved)
Reduce messageRange (less surrounding context)
Increase threshold (only highly relevant matches)
Balance with lastMessages to avoid redundancy

Next Steps

Working Memory

Store structured user information across conversations

RAG Overview

Learn about document-based RAG in Mastra

Conversation History

Manage recent message persistence

Get Started

Core Concepts

Agents

Workflows

Memory

RAG

Tools & MCP

Storage

Server & API

Observability

Evals

Deployment

Semantic Recall

How It Works

Basic Configuration

Configuration Options

Configuration Examples

Simple Setup

Advanced Configuration

Thread-Scoped Recall

Vector Store Setup

Embedder Configuration

Index Optimization

Cross-Thread Recall

Embedding Cache

Implementation Details

Best Practices

Choose the Right Scope

Tune TopK

Set a Threshold

Optimize Indexes

Troubleshooting

Next Steps

Working Memory

RAG Overview

Conversation History

Build docs developers (and LLMs) love

Get Started

Core Concepts

Agents

Workflows

Memory

RAG

Tools & MCP

Storage

Server & API

Observability

Evals

Deployment

​How It Works

​Basic Configuration

​Configuration Options

​Configuration Examples

​Simple Setup

​Advanced Configuration

​Thread-Scoped Recall

​Vector Store Setup

​Embedder Configuration

​Index Optimization

​Cross-Thread Recall

​Embedding Cache

​Implementation Details

​Best Practices

Choose the Right Scope

Tune TopK

Set a Threshold

Optimize Indexes

​Troubleshooting

​Next Steps

Working Memory

RAG Overview

Conversation History

Build docs developers (and LLMs) love

How It Works

Basic Configuration

Configuration Options

Configuration Examples

Simple Setup

Advanced Configuration

Thread-Scoped Recall

Vector Store Setup

Embedder Configuration

Index Optimization

Cross-Thread Recall

Embedding Cache

Implementation Details

Best Practices

Troubleshooting

Next Steps