LlamaIndex Query Engine

This example demonstrates how to create and instrument a LlamaIndex query engine with OpenInference tracing.

Prerequisites

Python 3.9+
OpenAI API key
Phoenix or another OpenTelemetry collector

Installation

Install dependencies

pip install llama-index llama-index-core llama-index-llms-openai \
  openinference-instrumentation-llama-index \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp

Set environment variables

export OPENAI_API_KEY="your-api-key"
export COLLECTOR_ENDPOINT="http://localhost:6006/v1/traces"

Instrumentation Setup

First, create an instrumentation module:

import os

from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
from openinference.semconv.resource import ResourceAttributes
from opentelemetry import trace as trace_api
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import SimpleSpanProcessor


def instrument():
    collector_endpoint = os.getenv("COLLECTOR_ENDPOINT")
    resource = Resource(attributes={ResourceAttributes.PROJECT_NAME: "llama-index-chat"})
    tracer_provider = trace_sdk.TracerProvider(resource=resource)
    span_exporter = OTLPSpanExporter(endpoint=collector_endpoint)
    span_processor = SimpleSpanProcessor(span_exporter=span_exporter)
    tracer_provider.add_span_processor(span_processor=span_processor)
    trace_api.set_tracer_provider(tracer_provider=tracer_provider)
    LlamaIndexInstrumentor().instrument()
    print("🔭 OpenInference instrumentation enabled.")

Complete Query Engine Example

import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Load environment and instrument
load_dotenv()
from instrument import instrument
instrument()

# Configure LLM and embeddings
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(similarity_top_k=3)

# Query the engine
response = query_engine.query("What is the main topic of these documents?")
print(response)

Chat Engine Example

LlamaIndex also provides chat engines for conversational applications:

from llama_index.core.memory import ChatMemoryBuffer

# Create chat engine with memory
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(
    chat_mode="context",
    memory=memory,
    system_prompt=(
        "You are a helpful assistant with access to a knowledge base. "
        "Always ground your answers in the provided context."
    ),
)

# Have a conversation
response1 = chat_engine.chat("Tell me about the key concepts.")
print(response1)

response2 = chat_engine.chat("Can you elaborate on the first point?")
print(response2)

Streaming Responses

# Create streaming query engine
query_engine = index.as_query_engine(streaming=True)

# Stream the response
streaming_response = query_engine.query("Explain the main ideas.")
for text in streaming_response.response_gen:
    print(text, end="", flush=True)
print()

Key Features

Automatic Tracing

LlamaIndex instrumentation captures:

Query execution: Full query pipeline from input to output
Retrieval: Document retrieval with similarity scores
LLM calls: All calls to language models
Embeddings: Embedding generation for queries and documents
Node processing: Document chunking and indexing

Resource Attributes

Use resource attributes to organize projects:

resource = Resource(attributes={
    ResourceAttributes.PROJECT_NAME: "my-app",
})

Memory and Context

The instrumentation tracks:

Conversation history in chat engines
Context window management
Memory buffer operations

Production Setup

For production deployments:

import uvicorn
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

from instrument import instrument

# Initialize instrumentation before app creation
instrument()

app = FastAPI(title="LlamaIndex API")

@app.post("/query")
async def query_endpoint(question: str):
    response = query_engine.query(question)
    return {"answer": str(response)}

@app.post("/chat/stream")
async def chat_stream_endpoint(message: str):
    streaming_response = chat_engine.stream_chat(message)
    
    async def generate():
        for token in streaming_response.response_gen:
            yield token
    
    return StreamingResponse(generate(), media_type="text/plain")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Python Examples

JavaScript Examples

Java Examples

Prerequisites

Installation

Instrumentation Setup

Complete Query Engine Example

Chat Engine Example

Streaming Responses

Key Features

Automatic Tracing

Resource Attributes

Memory and Context

Production Setup

Next Steps

Build docs developers (and LLMs) love

Python Examples

JavaScript Examples

Java Examples

​Prerequisites

​Installation

​Instrumentation Setup

​Complete Query Engine Example

​Chat Engine Example

​Streaming Responses

​Key Features

​Automatic Tracing

​Resource Attributes

​Memory and Context

​Production Setup

​Next Steps

Build docs developers (and LLMs) love

Prerequisites

Installation

Instrumentation Setup

Complete Query Engine Example

Chat Engine Example

Streaming Responses

Key Features

Automatic Tracing

Resource Attributes

Memory and Context

Production Setup

Next Steps