Sampling

Sampling lets an MCP Server ask the MCP Client to call an LLM on its behalf. This is useful when your server needs AI-generated content (like a summary or analysis) but shouldn’t — or can’t — call an LLM directly. The client, which already has access to an LLM, handles the request and returns the result.

When to use sampling

A concrete example: a blog post creation tool that also needs a generated abstract. The server has all the content, but the LLM lives on the client side.

User → MCP Client: "Author blog post"
         ↓
MCP Client → MCP Server: Tool call (create_blog)
         ↓
MCP Server → MCP Client: sampling/createMessage (create summary)
         ↓
MCP Client → LLM: Generate abstract
         ↓
LLM → MCP Client: Abstract text
         ↓
MCP Client → MCP Server: Sampling response (abstract)
         ↓
MCP Server → MCP Client: Complete blog post (draft + abstract)
         ↓
MCP Client → User: Blog post ready

The sampling request

The server sends a sampling/createMessage JSON-RPC request to the client:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "sampling/createMessage",
  "params": {
    "messages": [
      {
        "role": "user",
        "content": {
          "type": "text",
          "text": "Create a blog post summary of the following blog post: <BLOG POST>"
        }
      }
    ],
    "modelPreferences": {
      "hints": [
        { "name": "claude-3-sonnet" }
      ],
      "intelligencePriority": 0.8,
      "speedPriority": 0.5
    },
    "systemPrompt": "You are a helpful assistant.",
    "maxTokens": 100
  }
}

Key fields

Field	Description
`messages`	The conversation messages to send to the LLM
`modelPreferences.hints`	Preferred models (the client may use a different one)
`intelligencePriority`	0–1 scale; higher = prefer smarter model
`speedPriority`	0–1 scale; higher = prefer faster model
`systemPrompt`	System instruction for the LLM
`maxTokens`	Recommended token limit for the response

Model preferences are recommendations only. The user (via the client) can choose a different model. Your server code must handle responses from any model.

The sampling response

After the client calls the LLM, it sends the result back to the server:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "role": "assistant",
    "content": {
      "type": "text",
      "text": "Here's your abstract: <ABSTRACT>"
    },
    "model": "gpt-5",
    "stopReason": "endTurn"
  }
}

Note: the model in the response may differ from what you requested — the user chose gpt-5 instead of claude-3-sonnet.

Message content types

Sampling messages support text, images, and audio:

{
  "type": "text",
  "text": "The message content"
}

Implementing a sampling server (Python)

Here’s a complete blog post tool that uses sampling to generate an abstract:

from mcp.server.fastmcp import Context, FastMCP
from mcp.server.session import ServerSession
from mcp.types import SamplingMessage, TextContent
from pydantic import BaseModel
import json

mcp = FastMCP("Blog post generator")
posts = []

class BlogPost(BaseModel):
    id: int
    title: str
    content: str
    abstract: str = ""

@mcp.tool()
async def create_blog(
    title: str,
    content: str,
    ctx: Context[ServerSession, None]
) -> str:
    """Create a blog post and generate a summary using sampling."""

    # Step 1: Create the blog post object
    post = BlogPost(
        id=len(posts) + 1,
        title=title,
        content=content,
        abstract=""
    )

    # Step 2: Send a sampling request to the client
    prompt = f"Create an abstract of the following blog post: title: {title} and draft: {content}"

    result = await ctx.session.create_message(
        messages=[
            SamplingMessage(
                role="user",
                content=TextContent(type="text", text=prompt),
            )
        ],
        max_tokens=100,
    )

    # Step 3: Use the LLM response as the abstract
    post.abstract = result.content.text
    posts.append(post)

    # Step 4: Return the complete post
    return json.dumps({
        "id": post.id,
        "title": post.title,
        "abstract": post.abstract
    })

Enabling sampling in the client

If you are also building the client (not just the server), declare sampling support in client capabilities:

{
  "capabilities": {
    "sampling": {}
  }
}

If you are only building the MCP Server, you don’t need to configure anything on the client side — the host application (Claude Desktop, VS Code, etc.) handles sampling responses automatically.

Key takeaways

Sampling lets a server delegate LLM calls to the client — the server sends a sampling/createMessage request and the client calls the LLM and returns the result.
Model preferences are recommendations; the client and user choose the actual model used.
Sampling messages support text, image, and audio content types.
The server uses ctx.session.create_message() (Python) to issue sampling requests from within a tool.
This pattern is only available with the low-level server API or via the Context object in FastMCP.

Get Started

Foundation

Building Your First Server

Practical & Advanced

Community & Best Practices

When to use sampling

The sampling request

Key fields

The sampling response

Message content types

Implementing a sampling server (Python)

Enabling sampling in the client

Key takeaways

Build docs developers (and LLMs) love

Get Started

Foundation

Building Your First Server

Practical & Advanced

Community & Best Practices

​When to use sampling

​The sampling request

​Key fields

​The sampling response

​Message content types

​Implementing a sampling server (Python)

​Enabling sampling in the client

​Key takeaways

Build docs developers (and LLMs) love

When to use sampling

The sampling request

Key fields

The sampling response

Message content types

Implementing a sampling server (Python)

Enabling sampling in the client

Key takeaways