Skip to main content

Overview

LiteLLM provides comprehensive support for Cohere’s models including Command R+, chat completions, embeddings, and reranking capabilities.

Quick Start

1

Install LiteLLM

pip install litellm
2

Set API Key

export COHERE_API_KEY="your-api-key"
3

Make Your First Call

from litellm import completion

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Supported Models

Most capable model for complex tasks.
from litellm import completion

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "Analyze this data..."}],
    max_tokens=1000,
    temperature=0.7
)

Authentication

export COHERE_API_KEY="your-api-key"
from litellm import completion

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "Hello!"}]
)

Function Calling

Cohere supports function calling with automatic tool translation.
from litellm import completion

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "What's the weather in NYC?"}],
    tools=tools
)

# Check for tool calls
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Tool: {tool_call.function.name}")
    print(f"Args: {tool_call.function.arguments}")

Streaming

from litellm import completion

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "Write a story..."}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Embeddings

Latest embedding models with improved performance.
from litellm import embedding

response = embedding(
    model="cohere/embed-english-v3.0",
    input=["Text to embed", "Another text"]
)

embeddings = [data.embedding for data in response.data]

Reranking

Cohere’s rerank models improve search results.
from litellm import rerank

response = rerank(
    model="cohere/rerank-english-v3.0",
    query="What is the capital of France?",
    documents=[
        "Paris is the capital of France.",
        "London is the capital of England.",
        "Berlin is the capital of Germany."
    ],
    top_n=2
)

# Get ranked results
for result in response.results:
    print(f"Score: {result.relevance_score}")
    print(f"Document: {result.document}")

Citations

Cohere automatically provides citations for grounded responses.
from litellm import completion

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "Tell me about LiteLLM"}],
    documents=[
        {"text": "LiteLLM is a unified interface for LLMs."},
        {"text": "It supports 100+ LLM providers."}
    ]
)

# Access citations
if hasattr(response, 'citations'):
    for citation in response.citations:
        print(f"Cited document: {citation.document_ids}")

Configuration

from litellm import completion

response = completion(
    model="cohere/command-r-plus",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.8,
    max_tokens=500,
    top_p=0.9,
    frequency_penalty=0.5,
    presence_penalty=0.5
)

Supported Parameters

ParameterTypeDescription
temperaturefloatRandomness (0-1)
max_tokensintMax output tokens
max_completion_tokensintAlternative to max_tokens
top_pfloatNucleus sampling
frequency_penaltyfloatReduce repetition
presence_penaltyfloatEncourage diversity
stoplistStop sequences
nintNumber of completions
seedintReproducibility
preamblestrSystem message
kintTop-k sampling
documentslistDocuments for grounding

Error Handling

from litellm import completion
from litellm.exceptions import APIError, RateLimitError

try:
    response = completion(
        model="cohere/command-r-plus",
        messages=[{"role": "user", "content": "Hello!"}]
    )
except RateLimitError as e:
    print(f"Rate limit exceeded: {e}")
except APIError as e:
    print(f"API error: {e.status_code} - {e.message}")

LiteLLM Proxy

Use Cohere through the LiteLLM proxy server.
model_list:
  - model_name: command-r-plus
    litellm_params:
      model: cohere/command-r-plus
      api_key: os.environ/COHERE_API_KEY
import openai

client = openai.OpenAI(
    api_key="sk-1234",  # LiteLLM proxy key
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="command-r-plus",
    messages=[{"role": "user", "content": "Hello!"}]
)

Best Practices

  • Use max_completion_tokens instead of deprecated max_tokens
  • Monitor token usage via response.usage
  • Cohere uses billed units for accurate billing
  • Use Command R for balanced performance/cost
  • Use Command R+ for complex reasoning
  • Enable streaming for faster perceived response times
  • LiteLLM automatically converts OpenAI format to Cohere format
  • Use force_single_step=True when needed
  • Handle tool results properly in conversation history

Build docs developers (and LLMs) love