Data anonymization

The LangSmith SDK provides utilities to anonymize sensitive data in traces. Use the createAnonymizer function to automatically redact PII, secrets, and other sensitive information before it’s sent to LangSmith.

How it works

The anonymizer:

Extracts all string values from your data (inputs, outputs, metadata)
Applies rules or custom functions to detect and replace sensitive patterns
Reconstructs the data structure with redacted values

All data structures (nested objects, arrays) are preserved - only string values are modified.

Basic usage with regex patterns

Define patterns to match and redact:

from langsmith.anonymizer import create_anonymizer
import re

# Define rules for redaction
anonymizer = create_anonymizer(
    [
        {"pattern": re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.I), "replace": "[email]"},
        {"pattern": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), "replace": "[ssn]"},
        {"pattern": re.compile(r"sk-[a-zA-Z0-9]{32,}"), "replace": "[api-key]"},
    ]
)

# Apply to data
data = {
    "message": "Contact [email protected] for API key sk-abc123xyz456",
    "ssn": "123-45-6789"
}

redacted = anonymizer(data)
print(redacted)
# Output:
# {
#   "message": "Contact [email] for API key [api-key]",
#   "ssn": "[ssn]"
# }

Using with traceable

Integrate anonymization into your tracing workflow:

from langsmith import traceable
from langsmith.anonymizer import create_anonymizer
import re

# Create anonymizer
anonymizer = create_anonymizer(
    [
        {"pattern": re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.I), "replace": "[email]"},
        {"pattern": re.compile(r"\d{16}"), "replace": "[credit-card]"},
    ]
)

@traceable(
    process_inputs=anonymizer,
    process_outputs=anonymizer
)
def process_user_data(user_input: dict) -> dict:
    # Process data - sensitive info is redacted in traces
    response = {
        "message": f"Processed request from {user_input['email']}",
        "payment": user_input.get("card")
    }
    return response

# Sensitive data in traces will be redacted
result = process_user_data({
    "email": "[email protected]",
    "card": "1234567890123456"
})

Custom anonymizer function

Use a custom function for complex logic:

from langsmith.anonymizer import create_anonymizer

def custom_redactor(value: str, path: list) -> str:
    """Custom function to redact based on value and path."""
    # Redact based on field path
    if "password" in path or "secret" in path:
        return "[redacted]"
    
    # Redact phone numbers
    if len(value) == 10 and value.isdigit():
        return "[phone]"
    
    # Redact credit cards (basic check)
    if len(value) == 16 and value.isdigit():
        return "[credit-card]"
    
    return value

anonymizer = create_anonymizer(custom_redactor)

data = {
    "user": {
        "name": "Alice",
        "password": "super-secret-123",
        "phone": "5551234567"
    },
    "payment": "1234567890123456"
}

redacted = anonymizer(data)
print(redacted)
# Output:
# {
#   "user": {
#     "name": "Alice",
#     "password": "[redacted]",
#     "phone": "[phone]"
#   },
#   "payment": "[credit-card]"
# }

Common patterns

Here are ready-to-use patterns for common sensitive data:

import re
from langsmith.anonymizer import create_anonymizer

# Common PII patterns
common_patterns = [
    # Email addresses
    {"pattern": re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.I), "replace": "[email]"},
    
    # Phone numbers (US format)
    {"pattern": re.compile(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b"), "replace": "[phone]"},
    
    # Social Security Numbers
    {"pattern": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), "replace": "[ssn]"},
    
    # Credit card numbers (basic)
    {"pattern": re.compile(r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b"), "replace": "[credit-card]"},
    
    # API keys (common formats)
    {"pattern": re.compile(r"sk-[a-zA-Z0-9]{32,}"), "replace": "[api-key]"},
    {"pattern": re.compile(r"[a-zA-Z0-9_-]{32,}"), "replace": "[token]"},
    
    # IP addresses
    {"pattern": re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b"), "replace": "[ip]"},
    
    # URLs with credentials
    {"pattern": re.compile(r"https?://[^:]+:[^@]+@"), "replace": "https://[credentials]@"},
]

anonymizer = create_anonymizer(common_patterns)

Controlling traversal depth

Limit how deep the anonymizer traverses nested objects:

from langsmith.anonymizer import create_anonymizer
import re

anonymizer = create_anonymizer(
    [{"pattern": re.compile(r"secret", re.I), "replace": "[redacted]"}],
    max_depth=5  # Only traverse 5 levels deep
)

data = {
    "level1": {
        "level2": {
            "level3": {
                "level4": {
                    "level5": {
                        "level6": "secret data"  # May not be redacted if max_depth=5
                    }
                }
            }
        }
    }
}

Advanced: Custom node processor

For maximum control, implement a custom StringNodeProcessor:

from langsmith.anonymizer import create_anonymizer, StringNodeProcessor, StringNode
from typing import Any

class CustomProcessor(StringNodeProcessor):
    def mask_nodes(self, nodes: list[StringNode]) -> list[StringNode]:
        """Process all string nodes at once."""
        result = []
        for node in nodes:
            value = node["value"]
            path = node["path"]
            
            # Custom logic
            if "email" in str(path).lower():
                result.append({"value": "[email]", "path": path})
            elif len(value) > 100:
                result.append({"value": value[:50] + "...[truncated]", "path": path})
        
        return result

anonymizer = create_anonymizer(CustomProcessor())

Best practices

Start with common patterns

Use the ready-made patterns for emails, phones, SSNs, and API keys as a baseline.

Test your anonymizer

Validate that sensitive data is actually being redacted:

test_data = {"email": "[email protected]", "secret": "sk-abc123"}
redacted = anonymizer(test_data)
assert "[email protected]" not in str(redacted)
assert "sk-abc123" not in str(redacted)

Be cautious with overly broad patterns

Avoid patterns that might redact too much:

# Bad: Matches almost any string
{"pattern": re.compile(r"[a-zA-Z0-9]+"), "replace": "[redacted]"}

# Good: Specific pattern
{"pattern": re.compile(r"sk-[a-zA-Z0-9]{32,}"), "replace": "[api-key]"}

Use different labels for different types

Make it clear what was redacted:

"[email]", "[ssn]", "[api-key]", "[credit-card]"
# Not just "[redacted]" for everything

Consider performance

For high-volume tracing, keep patterns efficient and limit max_depth.

Important notes

Anonymization happens client-side before data is sent to LangSmith. Once data is sent without anonymization, it cannot be retroactively redacted.

The anonymizer only processes string values - other types (numbers, booleans) are unchanged
Nested objects and arrays are preserved - structure is maintained
Regex patterns should use the g (global) flag to replace all occurrences
Python uses re.compile() while TypeScript uses regex literals

Get Started

Core Concepts

Guides

How it works

Basic usage with regex patterns

Using with traceable

Custom anonymizer function

Common patterns

Controlling traversal depth

Advanced: Custom node processor

Best practices

Important notes

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​How it works

​Basic usage with regex patterns

​Using with traceable

​Custom anonymizer function

​Common patterns

​Controlling traversal depth

​Advanced: Custom node processor

​Best practices

​Important notes

Build docs developers (and LLMs) love

How it works

Basic usage with regex patterns

Using with traceable

Custom anonymizer function

Common patterns

Controlling traversal depth

Advanced: Custom node processor

Best practices

Important notes