Testing

LangSmith provides pytest integration for tracing test cases, caching LLM responses, and making assertions with the expect API.

test decorator

The @test decorator (or @pytest.mark.langsmith) traces pytest tests to LangSmith.

import pytest
from langsmith import test

@test  # or @pytest.mark.langsmith
def test_my_function():
    result = my_function("input")
    assert result == "expected"

Parameters

UUID | None

Unique identifier for the test case. Auto-generated based on test name if not provided.

output_keys

Sequence[str] | None

Keys from test function inputs to extract as expected outputs.

@test(output_keys=["expected"])
def test_function(input_data, expected):
    result = process(input_data)
    assert result == expected
# 'expected' is stored as ground truth in LangSmith

client

Client | None

Custom LangSmith client to use for this test.

test_suite_name

str | None

Name of the test suite. Defaults to the package name or environment variable.

@test(test_suite_name="integration-tests")
def test_integration():
    pass

metadata

dict | None

Additional metadata to attach to the test case.

@test(metadata={"category": "unit", "priority": "high"})
def test_critical_path():
    pass

repetitions

int | None

Number of times to repeat the test. Useful for testing variance.

@test(repetitions=5)
def test_with_variance():
    # Test runs 5 times to measure consistency
    result = llm_call()
    assert "expected" in result

cached_hosts

Sequence[str] | None

List of hosts to cache requests to. Only applicable when LANGSMITH_TEST_CACHE is set.

@test(cached_hosts=["api.openai.com"])
def test_openai_only_cached():
    # Only OpenAI calls are cached, others are live
    pass

Request caching

Cache LLM API calls to speed up tests and reduce costs:

# Enable caching
export LANGSMITH_TEST_CACHE=tests/cassettes

import pytest
import openai
from langsmith import wrappers

# Wrap the OpenAI client
oai_client = wrappers.wrap_openai(openai.Client())

@pytest.mark.langsmith
def test_openai_completion():
    response = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": "Say hello"}]
    )
    assert "hello" in response.choices[0].message.content.lower()
    # Response is cached for subsequent runs

Installation

Caching requires the vcr extra:

pip install "langsmith[vcr]"

How it works

First run: Real API calls are made and responses cached to disk
Subsequent runs: Cached responses are returned instantly
Cache files are stored in the directory specified by LANGSMITH_TEST_CACHE
Commit cache files to version control for consistent CI/CD

Selective caching

Cache only specific hosts:

@test(cached_hosts=["api.openai.com", "api.anthropic.com"])
def test_multi_provider():
    # Only OpenAI and Anthropic calls are cached
    # Other API calls are made live
    openai_response = openai_client.chat.completions.create(...)
    anthropic_response = anthropic_client.messages.create(...)

expect API

Make assertions and log scores within test cases.

import pytest
from langsmith import expect

@pytest.mark.langsmith
def test_with_expectations():
    response = my_llm_function("input")
    
    # Assert with automatic feedback logging
    expect.value(response).to_contain("expected text")
    
    # Score without assertion
    expect.score(0.95, key="quality")
    
    # Embedding similarity
    expect.embedding_distance(
        prediction=response,
        reference="expected output"
    ).to_be_less_than(0.5)
    
    # Edit distance
    expect.edit_distance(
        prediction=response,
        reference="expected"
    ).to_be_less_than(5)

Methods

expect.value

Make assertions on any value.

expect.value(response).to_contain("hello")
expect.value(count).to_be_greater_than(5)
expect.value(result).to_equal(expected)
expect.value(result).against(lambda x: x > 0)

Available assertions

to_equal(value): Exact equality
to_contain(substring): String contains substring
to_be_greater_than(value): Numeric comparison
to_be_less_than(value): Numeric comparison
to_be_between(min, max): Range check
to_be_approximately(value, precision): Approximate equality
against(func): Custom assertion function

expect.score

Log a score without assertion.

expect.score(0.85, key="accuracy")
expect.score(1.23, key="latency_seconds")

score

float

required

Numeric score to log.

key

str

Metric name. Defaults to "score".

Returns a matcher for optional assertions:

expect.score(0.85, key="accuracy").to_be_greater_than(0.8)

expect.embedding_distance

Compute semantic similarity using embeddings.

expect.embedding_distance(
    prediction="The cat sat on the mat",
    reference="A cat was sitting on a mat",
    config={"provider": "openai", "model": "text-embedding-3-small"}
).to_be_less_than(0.3)

prediction

str

required

Predicted text.

reference

str

required

Reference text.

config

EmbeddingConfig | None

Configuration for the embedding provider.

config = {
    "provider": "openai",  # or "huggingface"
    "model": "text-embedding-3-small"
}

Returns a matcher with the computed distance:

matcher = expect.embedding_distance(pred, ref)
matcher.to_be_less_than(0.5)  # Assert similarity

expect.edit_distance

Compute Levenshtein edit distance.

expect.edit_distance(
    prediction="hello world",
    reference="hello word",
    config={"normalize": True}
).to_be_less_than(2)

prediction

str

required

Predicted text.

reference

str

required

Reference text.

config

EditDistanceConfig | None

Configuration options:

config = {
    "normalize": True,  # Normalize by max length
}

unit decorator

Alias for @test that creates a unit test context.

from langsmith import unit

@unit
def test_helper_function():
    result = helper("input")
    assert result == "output"

Functionally identical to @test, but semantically indicates a unit test.

Complete example

import pytest
import openai
from langsmith import test, expect, wrappers

# Enable caching
# export LANGSMITH_TEST_CACHE=tests/cassettes

oai_client = wrappers.wrap_openai(openai.Client())

@test(
    test_suite_name="qa-pipeline",
    metadata={"category": "integration", "model": "gpt-3.5-turbo"},
    cached_hosts=["api.openai.com"]
)
def test_qa_accuracy():
    """Test QA pipeline with caching and expectations."""
    question = "What is the capital of France?"
    
    response = oai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Answer questions accurately."},
            {"role": "user", "content": question}
        ]
    )
    
    answer = response.choices[0].message.content
    
    # Assertions with automatic feedback logging
    expect.value(answer).to_contain("Paris")
    
    # Semantic similarity check
    expect.embedding_distance(
        prediction=answer,
        reference="The capital of France is Paris."
    ).to_be_less_than(0.3)
    
    # Log metrics
    expect.score(
        response.usage.total_tokens,
        key="token_count"
    ).to_be_less_than(100)

@test(repetitions=3)
def test_consistency():
    """Run multiple times to check consistency."""
    result = llm_generate("test")
    expect.value(result).to_contain("expected")

Environment variables

LANGSMITH_TEST_CACHE

str

Directory to store cached API responses. Enables caching when set.

export LANGSMITH_TEST_CACHE=tests/cassettes

LANGSMITH_TEST_TRACKING

str

Set to "false" to disable test tracking.

export LANGSMITH_TEST_TRACKING=false

Best practices

Commit cache files: Check in cached responses for consistent CI/CD
Use descriptive test names: They become dataset example names in LangSmith
Add metadata: Helps organize and filter tests
Use expect for metrics: Logs scores alongside test results
Cache expensive calls: Speeds up tests and reduces API costs

Client

Tracing

Evaluation

Wrappers

Utilities

test decorator

Parameters

Request caching

Installation

How it works

Selective caching

expect API

Methods

expect.value

expect.score

expect.embedding_distance

expect.edit_distance

unit decorator

Complete example

Environment variables

Best practices

Build docs developers (and LLMs) love

Client

Tracing

Evaluation

Wrappers

Utilities

​test decorator

​Parameters

​Request caching

​Installation

​How it works

​Selective caching

​expect API

​Methods

​expect.value

​expect.score

​expect.embedding_distance

​expect.edit_distance

​unit decorator

​Complete example

​Environment variables

​Best practices

Build docs developers (and LLMs) love

test decorator

Parameters

Request caching

Installation

How it works

Selective caching

expect API

Methods

expect.value

expect.score

expect.embedding_distance

expect.edit_distance

unit decorator

Complete example

Environment variables

Best practices