Agent performance depends on multiple factors: model latency, tool execution time, network I/O, and execution patterns. This guide covers proven optimization techniques for production deployments.
Cache identical requests to reduce costs and latency:
from langchain_core.caches import InMemoryCachefrom langchain_core.globals import set_llm_cachefrom langchain_openai import ChatOpenAI# Enable global LLM cachingset_llm_cache(InMemoryCache())model = ChatOpenAI(model="gpt-4")# First call: hits APIresponse1 = model.invoke("What is 2+2?")# Second identical call: served from cache (instant)response2 = model.invoke("What is 2+2?")
Cache vs Prompt Caching: This is LangChain’s application-level cache, distinct from provider-specific prompt caching (Anthropic Claude, OpenAI GPT-4, etc.).
Use async patterns for concurrent tool execution and reduced latency:
from langchain.agents import create_agentfrom langchain_core.tools import toolimport aiohttp@toolasync def fetch_weather(city: str) -> str: """Get weather for a city.""" async with aiohttp.ClientSession() as session: async with session.get(f"https://api.weather.com/{city}") as resp: data = await resp.json() return data["description"]@toolasync def fetch_news(topic: str) -> str: """Get latest news for a topic.""" async with aiohttp.ClientSession() as session: async with session.get(f"https://api.news.com/{topic}") as resp: data = await resp.json() return data["headline"]agent = create_agent( model="openai:gpt-4", tools=[fetch_weather, fetch_news], # Both async)# Async invocation enables concurrent tool executionresponse = await agent.ainvoke({ "messages": [HumanMessage("Get weather in SF and latest tech news")]})
When the model calls multiple tools in one turn, async tools execute concurrently, dramatically reducing total execution time.
Process multiple inputs efficiently with batching:
from langchain_openai import ChatOpenAImodel = ChatOpenAI(model="gpt-4")# Sequential (slow)results = [model.invoke(msg) for msg in messages] # N separate API calls# Batched (fast)results = model.batch(messages) # 1 batch API call# Async batching (fastest for large batches)results = await model.abatch(messages)
agent = create_agent(model="openai:gpt-4", tools=[search_tool])inputs = [ {"messages": [HumanMessage("Weather in NYC?")]}, {"messages": [HumanMessage("Capital of France?")]}, {"messages": [HumanMessage("2+2=")]},]# Process all inputs in parallelresults = await agent.abatch(inputs)
from langchain_openai import ChatOpenAI# Fast responses (lower quality)fast_model = ChatOpenAI( model="gpt-4o-mini", # Smaller, faster model temperature=0.3, # Lower temperature = faster sampling max_tokens=500, # Limit output length)# High quality (slower)quality_model = ChatOpenAI( model="gpt-4o", temperature=0.7, max_tokens=2000,)# Use fast model for simple tasks, quality model for complex onesfrom langchain.agents.middleware import wrap_model_call@wrap_model_calldef adaptive_model_selection(request, handler): """Use fast model for simple queries.""" user_message = request.messages[-1].content if len(user_message) < 50 and "?" in user_message: # Simple question: use fast model return handler(request.override(model=fast_model)) else: # Complex task: use quality model return handler(request)