Chat Completions

OpenAI-compatible chat completions endpoint for confidential AI inference. This endpoint is not exposed by the Umbra frontend server—instead, the frontend connects directly to the provider (vLLM) inside the TEE using authenticated TLS (aTLS) with attestation verification.

This is not a Next.js API route. The frontend uses the confidential-chat.ts library to connect directly to the provider endpoint inside the TEE.

Endpoint

POST {NEXT_PUBLIC_VLLM_BASE_URL}/v1/chat/completions

The base URL is configured via the NEXT_PUBLIC_VLLM_BASE_URL environment variable or provided by the user in the UI.

Authentication

Requires Bearer token authentication if the provider requires it. Token is passed via the Authorization header.

The aTLS connection is established using the @phala/dcap-qvl-web library, which verifies Intel TDX attestation quotes in-browser.

Security Requirements

Must use HTTPS (except for localhost/127.0.0.1 in development)
TDX attestation verification via aTLS
EKM channel binding to prevent MITM attacks

Request Parameters

model

string

required

Model identifier (e.g., “Qwen/Qwen2.5-32B-Instruct”). Configured via NEXT_PUBLIC_VLLM_MODEL or user settings.

messages

array

required

Array of message objects with role (“system”, “user”, or “assistant”) and content (string).

temperature

number

Sampling temperature (0.0 to 2.0). Defaults to 0.7.

max_tokens

number

Maximum tokens to generate. Defaults to 4098.

stream

boolean

Enable streaming responses. Defaults to true.

reasoning_effort

string

Reasoning effort level: “low”, “medium”, or “high”. For models that support reasoning.

cache_salt

string

Cache salt for request deduplication (provider-specific).

Request Body

{
  "model": "Qwen/Qwen2.5-32B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is Intel TDX?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 4096,
  "stream": true
}

Response (Non-Streaming)

string

Completion ID

choices

array

required

Array of completion choices. Each choice contains:

message: Object with role and content
finish_reason: Reason for completion (“stop”, “length”, etc.)

Non-Streaming Response Example

{
  "id": "cmpl-123456",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Intel TDX (Trust Domain Extensions) is a confidential computing technology..."
      },
      "finish_reason": "stop"
    }
  ]
}

Response (Streaming)

When stream: true, the server returns Server-Sent Events (SSE) with data: prefixed lines.

Streaming Format

data: {"choices":[{"delta":{"content":"Intel"}}]}
data: {"choices":[{"delta":{"content":" TDX"}}]}
data: {"choices":[{"delta":{"content":" is"}}]}
data: [DONE]

Each event contains:

choices[0].delta.content: Content chunk
choices[0].delta.reasoning_content: Reasoning chunk (for models that support it)
choices[0].finish_reason: Present in final chunk before [DONE]

Error Responses

400 Bad Request

Invalid request body
Missing required parameters
max_tokens too small or prompt too long

401 Unauthorized

Invalid or missing API key

503 Service Unavailable

Provider unreachable
Connection timeout
TLS/certificate error

Example

import { streamConfidentialChat } from "@/lib/confidential-chat";
import { createAtlsFetch } from "@phala/dcap-qvl-web";

// Create aTLS fetch with attestation verification
const atlasFetch = await createAtlsFetch({
  attestationServiceUrl: "https://your-attestation-service.com",
  verifyQuote: true,
});

// Stream chat completions
const stream = streamConfidentialChat(
  {
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What is Intel TDX?" },
    ],
    model: "Qwen/Qwen2.5-32B-Instruct",
    temperature: 0.7,
    max_tokens: 4096,
    stream: true,
  },
  {
    provider: {
      baseUrl: "https://your-provider.com",
      apiKey: "your-bearer-token",
    },
    fetchImpl: atlasFetch, // Use aTLS fetch
  }
);

// Process stream
for await (const chunk of stream) {
  if (chunk.type === "delta") {
    console.log(chunk.content);
  } else if (chunk.type === "error") {
    console.error("Error:", chunk.error);
  } else if (chunk.type === "done") {
    console.log("Complete:", chunk.content);
    console.log("Finish reason:", chunk.finish_reason);
  }
}

Message Validation

The confidential-chat.ts library validates all messages:

Role must be “system”, “user”, or “assistant”
Content must be a non-empty string
System message is automatically prepended if not present

Provider Configuration

The frontend supports dynamic provider configuration:

Base URL: NEXT_PUBLIC_VLLM_BASE_URL or user-provided
Model: NEXT_PUBLIC_VLLM_MODEL or user-provided
API Key: Optional Bearer token
System Prompt: NEXT_PUBLIC_DEFAULT_SYSTEM_PROMPT or custom
Temperature: NEXT_PUBLIC_DEFAULT_TEMPERATURE (default: 0.7)
Max Tokens: NEXT_PUBLIC_DEFAULT_MAX_TOKENS (default: 4098)

Reasoning Support

Some models support reasoning traces. The library handles:

reasoning_content in message objects
Streaming reasoning deltas via reasoning_delta chunks
Reasoning effort levels (“low”, “medium”, “high”)

Implementation Details

Attestation Requirement: The frontend library enforces aTLS connections in production. The fetchImpl parameter must be provided, typically using createAtlsFetch from @phala/dcap-qvl-web.

OpenAI Compatibility: This endpoint follows the OpenAI Chat Completions API specification, making it compatible with standard OpenAI client libraries (though you’ll need custom aTLS fetch for attestation).

Error Interpretation

The library provides helpful error messages:

Max tokens: “This request is larger than the model can process…”
Auth failure: “Authorization failed. Check the bearer token…”
Network failure: “Cannot connect to the provider. Please check…”
CORS: “CORS error: The provider is blocking requests…”
TLS/SSL: “TLS/SSL certificate error. Please verify…”
Timeout: “Request timed out. The provider may be overloaded…”

Source: frontend/lib/confidential-chat.ts

Attestation Service

Frontend API Routes

Admin API Routes

Chat Completions

Chat Completions

Endpoint

Authentication

Security Requirements

Request Parameters

Request Body

Response (Non-Streaming)

Non-Streaming Response Example

Response (Streaming)

Streaming Format

Error Responses

400 Bad Request

401 Unauthorized

503 Service Unavailable

Example

Message Validation

Provider Configuration

Reasoning Support

Implementation Details

Error Interpretation

Build docs developers (and LLMs) love

Attestation Service

Frontend API Routes

Admin API Routes

​Chat Completions

​Endpoint

​Authentication

​Security Requirements

​Request Parameters

​Request Body

​Response (Non-Streaming)

​Non-Streaming Response Example

​Response (Streaming)

​Streaming Format

​Error Responses

​400 Bad Request

​401 Unauthorized

​503 Service Unavailable

​Example

​Message Validation

​Provider Configuration

​Reasoning Support

​Implementation Details

​Error Interpretation

Build docs developers (and LLMs) love

Chat Completions

Endpoint

Authentication

Security Requirements

Request Parameters

Request Body

Response (Non-Streaming)

Non-Streaming Response Example

Response (Streaming)

Streaming Format

Error Responses

400 Bad Request

401 Unauthorized

503 Service Unavailable

Example

Message Validation

Provider Configuration

Reasoning Support

Implementation Details

Error Interpretation