Skip to main content

Endpoint

POST /api/chat
Send messages to the AI assistant and receive streaming responses. The AI uses RAG (Retrieval-Augmented Generation) to answer questions based on the PDF document content associated with the chat.
This endpoint runs on Vercel’s Edge Runtime for optimal streaming performance.

Request Body

chatId
number
required
The ID of the chat session to send messages to. Must be a valid chat ID created via /api/create-chat.
messages
array
required
Array of message objects representing the conversation history.Message Object Structure:
  • role (string): Either "user" or "system"
  • content (string): The message content

Response

Returns a streaming response using Server-Sent Events (SSE). The AI response is streamed token-by-token for real-time display.

Response Type

StreamingTextResponse
The response streams text chunks as they are generated by the AI model.

Response Headers

Content-Type: text/event-stream
Cache-Control: no-cache
Connection: keep-alive

Error Responses

error
string
Error message describing what went wrong

404 Not Found

Returned when the specified chat ID doesn’t exist:
{
  "error": "chat not found"
}

500 Internal Server Error

Returned when an unexpected error occurs:
{
  "error": "internal server error"
}

How It Works

  1. Context Retrieval: The last user message is used to retrieve relevant context from the PDF via Pinecone vector search
  2. Prompt Construction: A system prompt is created with the retrieved context and AI instructions
  3. Streaming Response: GPT-4 generates a streaming response based on the context
  4. Database Storage: Both user and AI messages are saved to the database
The AI will only answer questions based on the PDF content. If the answer isn’t in the context, it will respond: “I’m sorry, but I don’t know the answer to that question”.

Example Request

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    chatId: 1,
    messages: [
      {
        role: 'user',
        content: 'What is the main topic of this document?'
      }
    ]
  })
});

// Handle streaming response
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  console.log(chunk); // Process each chunk
}

Example Streaming Response

The response is streamed token-by-token:
The main topic of this document is...
(Text appears progressively as tokens are generated)

AI Model Configuration

  • Model: GPT-4 Turbo (gpt-4-1106-preview)
  • Temperature: Default (not specified, typically 1.0)
  • Streaming: Enabled
  • Context: Dynamically retrieved from PDF via semantic search

Message Persistence

Messages are automatically saved to the database:
  • onStart: User message is saved when streaming begins
  • onCompletion: AI response is saved when streaming completes
For multi-turn conversations, include the full message history in the messages array. The API filters and processes messages appropriately.

Context Retrieval

The endpoint uses semantic search to find relevant PDF content:
  1. Last user message is embedded using OpenAI embeddings
  2. Similar vectors are retrieved from Pinecone (top-k results)
  3. Retrieved text chunks are injected into the system prompt
  4. AI generates response based on this context

Best Practices

  • Include conversation history for context-aware responses
  • Keep individual messages under 4000 tokens for optimal performance
  • Handle streaming responses properly in your client
  • Implement error handling for network issues during streaming
  • Display a loading state while waiting for the first token

Build docs developers (and LLMs) love