Skip to main content
LangSmith provides pytest integration for tracing test cases, caching LLM responses, and making assertions with the expect API.

test decorator

The @test decorator (or @pytest.mark.langsmith) traces pytest tests to LangSmith.
import pytest
from langsmith import test

@test  # or @pytest.mark.langsmith
def test_my_function():
    result = my_function("input")
    assert result == "expected"

Parameters

id
UUID | None
Unique identifier for the test case. Auto-generated based on test name if not provided.
output_keys
Sequence[str] | None
Keys from test function inputs to extract as expected outputs.
@test(output_keys=["expected"])
def test_function(input_data, expected):
    result = process(input_data)
    assert result == expected
# 'expected' is stored as ground truth in LangSmith
client
Client | None
Custom LangSmith client to use for this test.
test_suite_name
str | None
Name of the test suite. Defaults to the package name or environment variable.
@test(test_suite_name="integration-tests")
def test_integration():
    pass
metadata
dict | None
Additional metadata to attach to the test case.
@test(metadata={"category": "unit", "priority": "high"})
def test_critical_path():
    pass
repetitions
int | None
Number of times to repeat the test. Useful for testing variance.
@test(repetitions=5)
def test_with_variance():
    # Test runs 5 times to measure consistency
    result = llm_call()
    assert "expected" in result
cached_hosts
Sequence[str] | None
List of hosts to cache requests to. Only applicable when LANGSMITH_TEST_CACHE is set.
@test(cached_hosts=["api.openai.com"])
def test_openai_only_cached():
    # Only OpenAI calls are cached, others are live
    pass

Request caching

Cache LLM API calls to speed up tests and reduce costs:
# Enable caching
export LANGSMITH_TEST_CACHE=tests/cassettes
import pytest
import openai
from langsmith import wrappers

# Wrap the OpenAI client
oai_client = wrappers.wrap_openai(openai.Client())

@pytest.mark.langsmith
def test_openai_completion():
    response = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Say hello"}]
    )
    assert "hello" in response.choices[0].message.content.lower()
    # Response is cached for subsequent runs

Installation

Caching requires the vcr extra:
pip install "langsmith[vcr]"

How it works

  1. First run: Real API calls are made and responses cached to disk
  2. Subsequent runs: Cached responses are returned instantly
  3. Cache files are stored in the directory specified by LANGSMITH_TEST_CACHE
  4. Commit cache files to version control for consistent CI/CD

Selective caching

Cache only specific hosts:
@test(cached_hosts=["api.openai.com", "api.anthropic.com"])
def test_multi_provider():
    # Only OpenAI and Anthropic calls are cached
    # Other API calls are made live
    openai_response = openai_client.chat.completions.create(...)
    anthropic_response = anthropic_client.messages.create(...)

expect API

Make assertions and log scores within test cases.
import pytest
from langsmith import expect

@pytest.mark.langsmith
def test_with_expectations():
    response = my_llm_function("input")
    
    # Assert with automatic feedback logging
    expect.value(response).to_contain("expected text")
    
    # Score without assertion
    expect.score(0.95, key="quality")
    
    # Embedding similarity
    expect.embedding_distance(
        prediction=response,
        reference="expected output"
    ).to_be_less_than(0.5)
    
    # Edit distance
    expect.edit_distance(
        prediction=response,
        reference="expected"
    ).to_be_less_than(5)

Methods

expect.value

Make assertions on any value.
expect.value(response).to_contain("hello")
expect.value(count).to_be_greater_than(5)
expect.value(result).to_equal(expected)
expect.value(result).against(lambda x: x > 0)
  • to_equal(value): Exact equality
  • to_contain(substring): String contains substring
  • to_be_greater_than(value): Numeric comparison
  • to_be_less_than(value): Numeric comparison
  • to_be_between(min, max): Range check
  • to_be_approximately(value, precision): Approximate equality
  • against(func): Custom assertion function

expect.score

Log a score without assertion.
expect.score(0.85, key="accuracy")
expect.score(1.23, key="latency_seconds")
score
float
required
Numeric score to log.
key
str
Metric name. Defaults to "score".
Returns a matcher for optional assertions:
expect.score(0.85, key="accuracy").to_be_greater_than(0.8)

expect.embedding_distance

Compute semantic similarity using embeddings.
expect.embedding_distance(
    prediction="The cat sat on the mat",
    reference="A cat was sitting on a mat",
    config={"provider": "openai", "model": "text-embedding-3-small"}
).to_be_less_than(0.3)
prediction
str
required
Predicted text.
reference
str
required
Reference text.
config
EmbeddingConfig | None
Configuration for the embedding provider.
config = {
    "provider": "openai",  # or "huggingface"
    "model": "text-embedding-3-small"
}
Returns a matcher with the computed distance:
matcher = expect.embedding_distance(pred, ref)
matcher.to_be_less_than(0.5)  # Assert similarity

expect.edit_distance

Compute Levenshtein edit distance.
expect.edit_distance(
    prediction="hello world",
    reference="hello word",
    config={"normalize": True}
).to_be_less_than(2)
prediction
str
required
Predicted text.
reference
str
required
Reference text.
config
EditDistanceConfig | None
Configuration options:
config = {
    "normalize": True,  # Normalize by max length
}

unit decorator

Alias for @test that creates a unit test context.
from langsmith import unit

@unit
def test_helper_function():
    result = helper("input")
    assert result == "output"
Functionally identical to @test, but semantically indicates a unit test.

Complete example

import pytest
import openai
from langsmith import test, expect, wrappers

# Enable caching
# export LANGSMITH_TEST_CACHE=tests/cassettes

oai_client = wrappers.wrap_openai(openai.Client())

@test(
    test_suite_name="qa-pipeline",
    metadata={"category": "integration", "model": "gpt-3.5-turbo"},
    cached_hosts=["api.openai.com"]
)
def test_qa_accuracy():
    """Test QA pipeline with caching and expectations."""
    question = "What is the capital of France?"
    
    response = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Answer questions accurately."},
            {"role": "user", "content": question}
        ]
    )
    
    answer = response.choices[0].message.content
    
    # Assertions with automatic feedback logging
    expect.value(answer).to_contain("Paris")
    
    # Semantic similarity check
    expect.embedding_distance(
        prediction=answer,
        reference="The capital of France is Paris."
    ).to_be_less_than(0.3)
    
    # Log metrics
    expect.score(
        response.usage.total_tokens,
        key="token_count"
    ).to_be_less_than(100)

@test(repetitions=3)
def test_consistency():
    """Run multiple times to check consistency."""
    result = llm_generate("test")
    expect.value(result).to_contain("expected")

Environment variables

LANGSMITH_TEST_CACHE
str
Directory to store cached API responses. Enables caching when set.
export LANGSMITH_TEST_CACHE=tests/cassettes
LANGSMITH_TEST_TRACKING
str
Set to "false" to disable test tracking.
export LANGSMITH_TEST_TRACKING=false

Best practices

  1. Commit cache files: Check in cached responses for consistent CI/CD
  2. Use descriptive test names: They become dataset example names in LangSmith
  3. Add metadata: Helps organize and filter tests
  4. Use expect for metrics: Logs scores alongside test results
  5. Cache expensive calls: Speeds up tests and reduces API costs

Build docs developers (and LLMs) love