Skip to main content

Overview

The OfficeFlow agent evolved through six versions, each addressing specific production issues discovered through testing and analysis. This progression demonstrates a realistic development cycle for production AI agents.
1

v0: Basic Implementation

Initial agent with no observability
2

v1: Add Tracing

LangSmith integration for debugging
3

v2: Fix Tool Descriptions

Improved schema discovery guidance
4

v3: Stock Communication Policy

Strategic inventory messaging
5

v4: RAG Implementation

Full document retrieval instead of chunking
6

v5: Conciseness Directive

Reduced verbosity in responses

v0: The Baseline Agent

What It Does

The initial implementation includes:
  • Basic chat loop with conversation history
  • Two tools: query_database and search_knowledge_base
  • RAG using text chunking and embeddings
  • System prompt with persona and guidelines

Key Characteristics

from openai import AsyncOpenAI

# No tracing integration
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Simple function without decoration
def query_database(query: str, db_path: str) -> str:
    try:
        conn = sqlite3.connect(db_path)
        cursor = conn.cursor()
        cursor.execute(query)
        results = cursor.fetchall()
        conn.close()
        return str(results)
    except Exception as e:
        return f"Error: {str(e)}"

# Basic tool definition
QUERY_DATABASE_TOOL = {
    "type": "function",
    "function": {
        "name": "query_database",
        "description": "SQL query to get information about our inventory.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "SQL query to execute"
                }
            },
            "required": ["query"]
        }
    }
}

The Problem

No observability. When the agent behaves unexpectedly, there’s no way to:
  • See what tool calls were made
  • Inspect the LLM’s reasoning
  • Debug multi-step interactions
  • Analyze patterns across conversations

v1: Adding Tracing

What Changed

Integration with LangSmith for complete observability:
from langsmith import traceable, uuid7
from langsmith.wrappers import wrap_openai

# Wrap OpenAI client for automatic tracing
client = wrap_openai(AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY")))

# Generate unique thread ID
thread_id = str(uuid7())

# Decorate tools with @traceable
@traceable(name="query_database", run_type="tool")
def query_database(query: str, db_path: str) -> str:
    try:
        conn = sqlite3.connect(db_path)
        cursor = conn.cursor()
        cursor.execute(query)
        results = cursor.fetchall()
        conn.close()
        return str(results)
    except Exception as e:
        return f"Error: {str(e)}"

@traceable(name="search_knowledge_base", run_type="tool")
async def search_knowledge_base(query: str, top_k: int = 2) -> str:
    # ... implementation
    pass

# Decorate main chat function
@traceable(name="Emma", metadata={"thread_id": thread_id})
async def chat(question: str) -> str:
    # ... chat logic
    pass

Benefits

Complete Visibility

Every LLM call, tool invocation, and intermediate step is recorded

Easy Debugging

Click through traces in LangSmith UI to see exactly what happened

Thread Tracking

Associate multiple interactions with the same conversation

Performance Analysis

Measure latency, token usage, and cost per interaction

v2: Enhanced Tool Descriptions

The Problem

In production, the agent would sometimes fail to query the database correctly because it didn’t know the schema. It would make assumptions or generate invalid SQL.

The Fix

Added explicit schema discovery instructions to the tool description:
QUERY_DATABASE_TOOL = {
    "type": "function",
    "function": {
        "name": "query_database",
        "description": "SQL query to get information about our inventory.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": """SQL query to execute against the inventory database.

YOU DO NOT KNOW THE SCHEMA. ALWAYS discover it first:
1. Query 'SELECT name FROM sqlite_master WHERE type="table"' to see available tables
2. Use 'PRAGMA table_info(table_name)' to inspect columns for each table
3. Only after understanding the schema, construct your search queries"""
                }
            },
            "required": ["query"]
        }
    }
}

Key Insight

Tool descriptions are critical prompt engineering real estate. The LLM reads them every time it decides whether to use a tool and how to format arguments.
By adding step-by-step instructions directly in the tool description, we ensure the agent follows the correct discovery process without requiring system prompt changes.

v3: Stock Quantity Policy

The Problem

The agent was revealing exact stock quantities to customers: “We have 47 units in stock.” This:
  • Exposed competitive information
  • Gave customers leverage to negotiate or wait
  • Didn’t create urgency for low-stock items

The Solution

Added a comprehensive stock communication policy to the system prompt:
IMPORTANT - STOCK INFORMATION POLICY:
When discussing product availability, NEVER reveal specific stock quantities or numbers to customers. Instead:
- If quantity > 20: Say the item is "in stock" or "available"
- If quantity 10-20: Say the item is "in stock, but running low" or "available, though inventory is limited" to create urgency
- If quantity 5-9: Say "only a few left in stock" or "limited availability" to encourage quick action
- If quantity 1-4: Say "very limited stock remaining" or "almost sold out"
- If quantity 0: Say "currently out of stock" or "unavailable at the moment"

This policy protects our competitive advantage and inventory management strategy while still helping customers make informed purchasing decisions.

Business Impact

Competitors can’t gauge inventory levels by probing the agent
Low stock items use language that encourages faster purchasing decisions
Maintains helpful tone while implementing business strategy

v4: RAG Improvements

The Problem

v0-v3 used text chunking for the knowledge base:
  • Documents split into 200-character chunks with 20-character overlap
  • Agent retrieved 2 most relevant chunks
  • Often got incomplete information from split documents
  • Context boundaries could split important related information

The Fix

Switched to full document retrieval:
async def load_knowledge_base(kb_dir: str = "./knowledge_base") -> None:
    """Load knowledge base documents and embeddings for WHOLE documents (no chunking)."""
    global knowledge_base_docs, knowledge_base_embeddings

    kb_path = Path(kb_dir) / "documents"
    cache_path = Path(kb_dir) / "embeddings" / "embeddings.json"

    # Check if embeddings are stale
    if _embeddings_are_stale(kb_path, cache_path):
        print("Knowledge base documents changed, regenerating embeddings...")
        await _generate_and_cache_embeddings(kb_path, cache_path)
    else:
        # Load from cache
        with open(cache_path, 'r') as f:
            cache_data = json.load(f)
        knowledge_base_docs = [tuple(doc) for doc in cache_data["docs"]]
        knowledge_base_embeddings = cache_data["embeddings"]

@traceable(name="search_knowledge_base", run_type="tool")
async def search_knowledge_base(query: str, top_k: int = 2) -> str:
    """Search knowledge base using semantic similarity. Returns WHOLE documents, not chunks."""
    # ... generate query embedding
    # ... calculate similarities
    # ... return top k full documents

Benefits

Complete Context

The agent sees entire policy documents, not fragments

No Boundary Issues

Related information stays together

Better Answers

Can synthesize complete policies rather than partial snippets

Cache Invalidation

Automatically regenerates embeddings when docs change

Trade-offs

This approach works well for OfficeFlow because:
  • Knowledge base documents are relatively small (< 2000 tokens each)
  • The LLM context window can handle 2 full documents comfortably
  • Policy information is best understood in complete form
For larger documents or massive knowledge bases, chunking strategies may still be necessary.

v5: Conciseness Directive

The Problem

During evaluation, the agent’s responses were often too verbose:
  • Repeated information unnecessarily
  • Used filler phrases like “I’d be more than happy to help you with that”
  • Explained things that customers already understood
  • Took 3 sentences when 1 would suffice

The Solution

Added an explicit conciseness directive to the system prompt:
CONCISENESS PRIORITY:
Your responses should be brief and to the point. Avoid unnecessary filler, repetition, 
or overly elaborate explanations. Get straight to the answer. If you can say something 
in one sentence, don't use three. Customers appreciate quick, direct answers over lengthy responses.
Also updated example interactions to model concise responses:
Customer: "Do you have copy paper?"
You: "Yes, we do! We carry several types of copy paper. Are you looking for standard 8.5x11 inch letter size, or do you need a specific weight or finish? I can check what we have in stock."

Measurement

This change can be quantitatively evaluated using:
  • Token count reduction in responses
  • Character/word count metrics
  • Pairwise evaluation (v4 vs v5) with human or LLM judges
  • Customer satisfaction scores
See Evaluating Conciseness for implementation details.

Version Comparison

Featurev0v1v2v3v4v5
Basic chat
Tools (DB + KB)
LangSmith tracing
Schema discovery
Stock policy
Full doc RAG
Conciseness

Running Different Versions

cd source/python/officeflow-agent

# Run specific version
python agent_v0.py
python agent_v1.py
python agent_v2.py
python agent_v3.py
python agent_v4.py
python agent_v5.py  # Production version

Key Takeaways

Iterate Based on Evidence

Each version addresses a real issue discovered through testing or production use

Observability First

Adding tracing (v1) enables all subsequent improvements

Tool Descriptions Matter

They’re read on every tool use - make them comprehensive

Business Logic in Prompts

The stock policy (v3) shows how to encode business rules

RAG is Nuanced

Chunking vs full documents depends on your use case

Measure Everything

Conciseness improvements (v5) need evaluation to validate

Next Steps

Analyzing Agent Behavior

Learn how to use LangSmith traces to debug and improve your agents

Build docs developers (and LLMs) love