LlamaIndex Integration

Integrate Portkey with LlamaIndex to build robust RAG (Retrieval-Augmented Generation) applications with access to 250+ LLMs and production-grade reliability.

Overview

Portkey enhances LlamaIndex applications with:

Multi-Provider Support: Route to 250+ LLMs seamlessly
Reliability: Automatic fallbacks and retries
Performance: Smart caching for embeddings and completions
Observability: Full logging and tracing for RAG pipelines
Cost Optimization: Track and reduce token usage

Installation

pip install portkey-ai llama-index

Quick Start

LlamaIndex works with Portkey through the OpenAI-compatible interface:

Import and Configure

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL

# Create Portkey headers
portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="openai"
)

Initialize LLM

llm = OpenAI(
    model="gpt-4",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

Use in Your Application

response = llm.complete("Explain quantum computing")
print(response)

Complete RAG Setup

Build a complete RAG application with Portkey:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL

# Configure Portkey
portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="openai",
    metadata={"application": "rag-pipeline"}
)

# Configure LLM
llm = OpenAI(
    model="gpt-4",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

# Configure embeddings
embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

# Set global settings
Settings.llm = llm
Settings.embed_model = embed_model

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics in the documents?")
print(response)

Using Different Providers

Switch between providers easily:

portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="anthropic"
)

llm = OpenAI(
    model="claude-3-opus-20240229",
    api_key="your-anthropic-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

Advanced Routing

Fallback Configuration

Automatically fallback to backup providers:

config = {
    "strategy": {"mode": "fallback"},
    "targets": [
        {"virtual_key": "openai-virtual-key"},
        {"virtual_key": "anthropic-virtual-key"},
        {"virtual_key": "together-virtual-key"}
    ]
}

portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    config=config
)

llm = OpenAI(
    model="gpt-4",
    api_key="X",  # Virtual keys in config
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

Load Balancing

Distribute traffic across multiple models:

config = {
    "strategy": {"mode": "loadbalance"},
    "targets": [
        {
            "virtual_key": "openai-key-1",
            "weight": 0.7
        },
        {
            "virtual_key": "openai-key-2",
            "weight": 0.3
        }
    ]
}

Caching for Embeddings

Cache embeddings to reduce costs and improve performance:

config = {
    "cache": {
        "mode": "simple",  # or "semantic"
        "max_age": 86400  # 24 hours
    }
}

portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="openai",
    config=config
)

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

Chat Engine with Portkey

Build conversational applications:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.llms.openai import OpenAI
from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL

# Configure Portkey
portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="openai",
    metadata={
        "user_id": "user_123",
        "session_id": "session_456"
    }
)

llm = OpenAI(
    model="gpt-4",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

Settings.llm = llm

# Load documents and create index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create chat engine with memory
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    verbose=True
)

# Chat
response = chat_engine.chat("What is this document about?")
print(response)

response = chat_engine.chat("Can you elaborate on that?")
print(response)

Streaming Responses

Enable streaming for real-time responses:

llm = OpenAI(
    model="gpt-4",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers,
    streaming=True
)

response = llm.stream_complete("Tell me a long story")
for chunk in response:
    print(chunk.delta, end="", flush=True)

Multi-Document Agents

Build agents that reason over multiple documents:

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL

# Configure Portkey
portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="openai"
)

llm = OpenAI(
    model="gpt-4",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

# Load multiple document sets
research_docs = SimpleDirectoryReader("./research").load_data()
reports_docs = SimpleDirectoryReader("./reports").load_data()

# Create indices
research_index = VectorStoreIndex.from_documents(research_docs)
reports_index = VectorStoreIndex.from_documents(reports_docs)

# Create query engines
research_engine = research_index.as_query_engine()
reports_engine = reports_index.as_query_engine()

# Create tools
tools = [
    QueryEngineTool(
        query_engine=research_engine,
        metadata=ToolMetadata(
            name="research_papers",
            description="Contains research papers on AI"
        )
    ),
    QueryEngineTool(
        query_engine=reports_engine,
        metadata=ToolMetadata(
            name="company_reports",
            description="Contains company quarterly reports"
        )
    )
]

# Create agent
agent = ReActAgent.from_tools(tools, llm=llm, verbose=True)

# Query across documents
response = agent.chat("Compare the AI trends in research papers vs company reports")
print(response)

Observability and Monitoring

Track your RAG pipeline performance:

from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL

# Add detailed metadata
portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    provider="openai",
    metadata={
        "environment": "production",
        "user_id": "user_123",
        "query_type": "semantic_search",
        "document_count": 100
    },
    trace_id="rag-pipeline-001"
)

llm = OpenAI(
    model="gpt-4",
    api_key="your-openai-api-key",
    api_base=PORTKEY_GATEWAY_URL,
    default_headers=portkey_headers
)

View detailed metrics in the Portkey dashboard:

Query latency
Token usage per query
Cache hit rates
Error rates
Cost per query

Best Practices

Cache Embeddings

Enable caching for embeddings to avoid recomputing them for the same content.

config = {"cache": {"mode": "simple", "max_age": 86400}}

Use Fallbacks for Production

Always configure fallback providers for your RAG pipeline:

config = {
    "strategy": {"mode": "fallback"},
    "targets": [{"virtual_key": "primary"}, {"virtual_key": "backup"}]
}

Track Query Performance

Add metadata to track which queries are slow or expensive:

metadata = {"query_type": "complex_rag", "doc_count": 1000}

Optimize Chunk Sizes

Monitor token usage to optimize your chunking strategy and reduce costs.

Error Handling

Implement robust error handling:

from llama_index.llms.openai import OpenAI
from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL

config = {
    "retry": {
        "attempts": 3,
        "on_status_codes": [429, 500, 502, 503]
    },
    "strategy": {"mode": "fallback"},
    "targets": [
        {"virtual_key": "openai-key"},
        {"virtual_key": "anthropic-key"}
    ]
}

portkey_headers = createHeaders(
    api_key="your-portkey-api-key",
    config=config
)

try:
    llm = OpenAI(
        model="gpt-4",
        api_key="your-openai-api-key",
        api_base=PORTKEY_GATEWAY_URL,
        default_headers=portkey_headers
    )
    response = llm.complete("Your query")
except Exception as e:
    print(f"Error: {e}")

Resources

Need help? Join our Discord community for support.

Agent Frameworks

SDKs

Overview

Installation

Quick Start

Complete RAG Setup

Using Different Providers

Advanced Routing

Fallback Configuration

Load Balancing

Caching for Embeddings

Chat Engine with Portkey

Streaming Responses

Multi-Document Agents

Observability and Monitoring

Best Practices

Error Handling

Resources

Build docs developers (and LLMs) love

Agent Frameworks

SDKs

​Overview

​Installation

​Quick Start

​Complete RAG Setup

​Using Different Providers

​Advanced Routing

​Fallback Configuration

​Load Balancing

​Caching for Embeddings

​Chat Engine with Portkey

​Streaming Responses

​Multi-Document Agents

​Observability and Monitoring

​Best Practices

​Error Handling

​Resources

Build docs developers (and LLMs) love

Overview

Installation

Quick Start

Complete RAG Setup

Using Different Providers

Advanced Routing

Fallback Configuration

Load Balancing

Caching for Embeddings

Chat Engine with Portkey

Streaming Responses

Multi-Document Agents

Observability and Monitoring

Best Practices

Error Handling

Resources