LangSmith provides pytest integration for tracing test cases, caching LLM responses, and making assertions with the expect API.
test decorator
The @test decorator (or @pytest.mark.langsmith) traces pytest tests to LangSmith.
import pytest
from langsmith import test
@test # or @pytest.mark.langsmith
def test_my_function ():
result = my_function( "input" )
assert result == "expected"
Parameters
Unique identifier for the test case. Auto-generated based on test name if not provided.
Keys from test function inputs to extract as expected outputs. @test ( output_keys = [ "expected" ])
def test_function ( input_data , expected ):
result = process(input_data)
assert result == expected
# 'expected' is stored as ground truth in LangSmith
Custom LangSmith client to use for this test.
Name of the test suite. Defaults to the package name or environment variable. @test ( test_suite_name = "integration-tests" )
def test_integration ():
pass
Additional metadata to attach to the test case. @test ( metadata = { "category" : "unit" , "priority" : "high" })
def test_critical_path ():
pass
Number of times to repeat the test. Useful for testing variance. @test ( repetitions = 5 )
def test_with_variance ():
# Test runs 5 times to measure consistency
result = llm_call()
assert "expected" in result
List of hosts to cache requests to. Only applicable when LANGSMITH_TEST_CACHE is set. @test ( cached_hosts = [ "api.openai.com" ])
def test_openai_only_cached ():
# Only OpenAI calls are cached, others are live
pass
Request caching
Cache LLM API calls to speed up tests and reduce costs:
# Enable caching
export LANGSMITH_TEST_CACHE = tests / cassettes
import pytest
import openai
from langsmith import wrappers
# Wrap the OpenAI client
oai_client = wrappers.wrap_openai(openai.Client())
@pytest.mark.langsmith
def test_openai_completion ():
response = oai_client.chat.completions.create(
model = "gpt-3.5-turbo" ,
messages = [{ "role" : "user" , "content" : "Say hello" }]
)
assert "hello" in response.choices[ 0 ].message.content.lower()
# Response is cached for subsequent runs
Installation
Caching requires the vcr extra:
pip install "langsmith[vcr]"
How it works
First run: Real API calls are made and responses cached to disk
Subsequent runs: Cached responses are returned instantly
Cache files are stored in the directory specified by LANGSMITH_TEST_CACHE
Commit cache files to version control for consistent CI/CD
Selective caching
Cache only specific hosts:
@test ( cached_hosts = [ "api.openai.com" , "api.anthropic.com" ])
def test_multi_provider ():
# Only OpenAI and Anthropic calls are cached
# Other API calls are made live
openai_response = openai_client.chat.completions.create( ... )
anthropic_response = anthropic_client.messages.create( ... )
expect API
Make assertions and log scores within test cases.
import pytest
from langsmith import expect
@pytest.mark.langsmith
def test_with_expectations ():
response = my_llm_function( "input" )
# Assert with automatic feedback logging
expect.value(response).to_contain( "expected text" )
# Score without assertion
expect.score( 0.95 , key = "quality" )
# Embedding similarity
expect.embedding_distance(
prediction = response,
reference = "expected output"
).to_be_less_than( 0.5 )
# Edit distance
expect.edit_distance(
prediction = response,
reference = "expected"
).to_be_less_than( 5 )
Methods
expect.value
Make assertions on any value.
expect.value(response).to_contain( "hello" )
expect.value(count).to_be_greater_than( 5 )
expect.value(result).to_equal(expected)
expect.value(result).against( lambda x : x > 0 )
to_equal(value): Exact equality
to_contain(substring): String contains substring
to_be_greater_than(value): Numeric comparison
to_be_less_than(value): Numeric comparison
to_be_between(min, max): Range check
to_be_approximately(value, precision): Approximate equality
against(func): Custom assertion function
expect.score
Log a score without assertion.
expect.score( 0.85 , key = "accuracy" )
expect.score( 1.23 , key = "latency_seconds" )
Metric name. Defaults to "score".
Returns a matcher for optional assertions:
expect.score( 0.85 , key = "accuracy" ).to_be_greater_than( 0.8 )
expect.embedding_distance
Compute semantic similarity using embeddings.
expect.embedding_distance(
prediction = "The cat sat on the mat" ,
reference = "A cat was sitting on a mat" ,
config = { "provider" : "openai" , "model" : "text-embedding-3-small" }
).to_be_less_than( 0.3 )
Configuration for the embedding provider. config = {
"provider" : "openai" , # or "huggingface"
"model" : "text-embedding-3-small"
}
Returns a matcher with the computed distance:
matcher = expect.embedding_distance(pred, ref)
matcher.to_be_less_than( 0.5 ) # Assert similarity
expect.edit_distance
Compute Levenshtein edit distance.
expect.edit_distance(
prediction = "hello world" ,
reference = "hello word" ,
config = { "normalize" : True }
).to_be_less_than( 2 )
config
EditDistanceConfig | None
Configuration options: config = {
"normalize" : True , # Normalize by max length
}
unit decorator
Alias for @test that creates a unit test context.
from langsmith import unit
@unit
def test_helper_function ():
result = helper( "input" )
assert result == "output"
Functionally identical to @test, but semantically indicates a unit test.
Complete example
import pytest
import openai
from langsmith import test, expect, wrappers
# Enable caching
# export LANGSMITH_TEST_CACHE=tests/cassettes
oai_client = wrappers.wrap_openai(openai.Client())
@test (
test_suite_name = "qa-pipeline" ,
metadata = { "category" : "integration" , "model" : "gpt-3.5-turbo" },
cached_hosts = [ "api.openai.com" ]
)
def test_qa_accuracy ():
"""Test QA pipeline with caching and expectations."""
question = "What is the capital of France?"
response = oai_client.chat.completions.create(
model = "gpt-3.5-turbo" ,
messages = [
{ "role" : "system" , "content" : "Answer questions accurately." },
{ "role" : "user" , "content" : question}
]
)
answer = response.choices[ 0 ].message.content
# Assertions with automatic feedback logging
expect.value(answer).to_contain( "Paris" )
# Semantic similarity check
expect.embedding_distance(
prediction = answer,
reference = "The capital of France is Paris."
).to_be_less_than( 0.3 )
# Log metrics
expect.score(
response.usage.total_tokens,
key = "token_count"
).to_be_less_than( 100 )
@test ( repetitions = 3 )
def test_consistency ():
"""Run multiple times to check consistency."""
result = llm_generate( "test" )
expect.value(result).to_contain( "expected" )
Environment variables
Directory to store cached API responses. Enables caching when set. export LANGSMITH_TEST_CACHE = tests / cassettes
Set to "false" to disable test tracking. export LANGSMITH_TEST_TRACKING = false
Best practices
Commit cache files : Check in cached responses for consistent CI/CD
Use descriptive test names : They become dataset example names in LangSmith
Add metadata : Helps organize and filter tests
Use expect for metrics : Logs scores alongside test results
Cache expensive calls : Speeds up tests and reduces API costs