RAG Pipeline

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI pattern that enhances language models by providing them with relevant context retrieved from external knowledge sources. Instead of relying solely on the model’s training data, RAG systems:

Retrieve relevant information from a knowledge base
Augment the user’s query with this context
Generate an informed response using the language model

RAG dramatically improves accuracy and reduces hallucinations by grounding AI responses in actual document content.

Pipeline Architecture

The PDF AI RAG pipeline consists of three main stages:

Stage 1: Document Ingestion

Document ingestion transforms PDFs into searchable vector embeddings.

Stage 2: Context Retrieval

When a user asks a question, relevant document chunks are retrieved.

Stage 3: Response Generation

The retrieved context is combined with the user’s question to generate an answer.

Document Processing

The document processing pipeline is implemented in src/lib/pinecone.ts and handles converting PDFs into searchable vectors.

Step 1: PDF Download from S3

// src/lib/pinecone.ts:31
export async function loadS3IntoPinecone(fileKey: string) {
  // Download the PDF from S3
  console.log("downloading from s3...");
  const file_name = await downloadFromS3(fileKey);
  if (!file_name) {
    throw new Error("unable to download file from s3");
  }

The PDF is first downloaded from AWS S3 to a temporary local file.

The file is stored at D:/pdf-${Date.now()}.pdf during processing (src/lib/s3-server.ts:20).

Step 2: PDF Text Extraction

// src/lib/pinecone.ts:38
const loader = new PDFLoader(file_name);
const pages = (await loader.load()) as PDFPage[];

LangChain’s PDFLoader extracts text from every page, producing an array of page objects:

type PDFPage = {
  pageContent: string;
  metadata: {
    loc: { pageNumber: number };
  };
};

Step 3: Document Chunking

Large documents are split into smaller, semantically meaningful chunks:

// src/lib/pinecone.ts:83
async function prepareDocument(page: PDFPage) {
  let { pageContent, metadata } = page;
  
  // Remove newlines for cleaner text
  pageContent = pageContent.replace(/\n/g, "");
  
  // Split using RecursiveCharacterTextSplitter
  const splitter = new RecursiveCharacterTextSplitter();
  const docs = await splitter.splitDocuments([
    new Document({
      pageContent,
      metadata: {
        pageNumber: metadata.loc.pageNumber,
        text: truncateStringByBytes(pageContent, 36000),
      },
    }),
  ]);
  return docs;
}

Why chunking is important

Chunking serves three critical purposes:

Token Limits - Language models have context window limits. Smaller chunks ensure we stay within bounds.
Semantic Precision - Smaller chunks provide more focused, relevant context. A chunk about “security” won’t also include unrelated content about “pricing”.
Pinecone Constraints - Metadata is limited to 36KB per vector (hence the truncateStringByBytes call).

Step 4: Byte Truncation

Pinecone has a 36KB metadata limit per vector:

// src/lib/pinecone.ts:74
export const truncateStringByBytes = (str: string, bytes: number) => {
  const enc = new TextEncoder();
  const encodedBytes = enc.encode(str);
  const truncatedBytes = encodedBytes.slice(0, bytes);
  const decoder = new TextDecoder("utf-8");
  const truncatedString = decoder.decode(truncatedBytes);
  return truncatedString;
};

Truncation by bytes (not characters) is crucial because UTF-8 characters can be multiple bytes.

Step 5: Embedding Generation

Each chunk is converted to a 1536-dimension vector using OpenAI’s embedding model:

// src/lib/pinecone.ts:56
async function embedDocument(doc: Document) {
  try {
    const embeddings = await getEmbeddings(doc.pageContent);
    const hash = md5(doc.pageContent);
    return {
      id: hash,
      values: embeddings,
      metadata: {
        text: doc.metadata.text,
        pageNumber: doc.metadata.pageNumber,
      },
    } as PineconeRecord;
  } catch (error) {
    console.log(error);
    throw new Error("unable to embed document");
  }
}

The embedding function uses OpenAI’s text-embedding-ada-002 model:

// src/lib/embeddings.ts:9
export async function getEmbeddings(text: string) {
  try {
    const response = await openai.createEmbedding({
      model: "text-embedding-ada-002",
      input: text.replace(/\n/g, " "),
    });
    const result = await response.json();
    return result.data[0].embedding as number[];
  } catch (error) {
    console.log("error calling openai embeddings api", error);
    throw error;
  }
}

Why use MD5 hashing for IDs?MD5 hashing ensures deterministic IDs. If the same content is processed twice, it gets the same ID, preventing duplicate vectors in Pinecone.

Step 6: Pinecone Upload

Vectors are upserted to Pinecone with namespace isolation:

// src/lib/pinecone.ts:44
const vectors = await Promise.all(documents.flat().map(embedDocument));

const client = await getPineconeClient();
const pineconeIndex = await client.Index("aipdf");
const namespace = pineconeIndex.namespace(convertToAscii(fileKey));

console.log("uploading to pinecone...");
await namespace.upsert(vectors);

Each PDF gets its own namespace (based on the S3 file key) to ensure data isolation between documents.

Context Retrieval

When a user asks a question, the system retrieves relevant document chunks through semantic similarity search.

Step 1: Query Embedding

// src/lib/context.ts:46
export async function getContext(query: string, fileKey: string) {
  const queryEmbeddings = await getEmbeddings(query);
  const matches = await getMatchesFromEmbeddings(queryEmbeddings, fileKey);
  // ...
}

The user’s question is converted to the same embedding format as the documents.

Step 2: Vector Similarity Search

// src/lib/context.ts:18
export async function getMatchesFromEmbeddings(
  embeddings: number[],
  fileKey: string
) {
  const pinecone = new PineconeClient();
  await pinecone.init({
    apiKey: process.env.PINECONE_API_KEY!,
    environment: process.env.PINECONE_ENVIRONMENT!,
  });
  const index = await pinecone.Index("aipdf");

  try {
    const namespace = convertToAscii(fileKey);
    const queryResult = await index.query({
      queryRequest: {
        topK: 5,
        vector: embeddings,
        includeMetadata: true,
        namespace,
      },
    });
    return queryResult.matches || [];
  } catch (error) {
    console.log("error querying embeddings", error);
    throw error;
  }
}

Understanding topK parameter

topK: 5 means “return the 5 most similar document chunks.”Pinecone uses cosine similarity to rank vectors. The similarity score ranges from 0 to 1:

0.9-1.0 - Extremely similar
0.7-0.9 - Highly relevant
0.5-0.7 - Somewhat relevant
< 0.5 - Not relevant

Step 3: Filtering by Relevance

// src/lib/context.ts:50
const qualifyingDocs = matches.filter(
  (match) => match.score && match.score > 0.7
);

Only chunks with similarity scores above 0.7 are used. This threshold prevents irrelevant context from confusing the AI.

If no chunks score above 0.7, the AI will respond with “I don’t know” rather than hallucinating an answer.

Step 4: Context Assembly

// src/lib/context.ts:59
let docs = qualifyingDocs.map((match) => (match.metadata as Metadata).text);
return docs.join("\n").substring(0, 3000);

Retrieved chunks are concatenated and truncated to 3000 characters to fit within GPT-4’s context window.

Response Generation

The final stage combines retrieved context with the language model to generate answers.

System Prompt Construction

// src/app/api/chat/route.ts:28
const context = await getContext(lastMessage.content, fileKey);

const prompt = {
  role: "system",
  content: `AI assistant is a brand new, powerful, human-like artificial intelligence.
  The traits of AI include expert knowledge, helpfulness, cleverness, and articulateness.
  AI is a well-behaved and well-mannered individual.
  AI is always friendly, kind, and inspiring, and he is eager to provide vivid and thoughtful responses to the user.
  AI has the sum of all knowledge in their brain, and is able to accurately answer nearly any question about any topic in conversation.
  AI assistant is a big fan of Pinecone and Vercel.
  START CONTEXT BLOCK
  ${context}
  END OF CONTEXT BLOCK
  AI assistant will take into account any CONTEXT BLOCK that is provided in a conversation.
  If the context does not provide the answer to question, the AI assistant will say, "I'm sorry, but I don't know the answer to that question".
  AI assistant will not apologize for previous responses, but instead will indicated new information was gained.
  AI assistant will not invent anything that is not drawn directly from the context.
  `,
};

The system prompt explicitly instructs the AI to only use information from the context block, preventing hallucinations.

Streaming Chat Completion

// src/app/api/chat/route.ts:46
const response = await openai.createChatCompletion({
  model: "gpt-4-1106-preview",
  messages: [
    prompt,
    ...messages.filter((message: Message) => message.role === "user"),
  ],
  stream: true,
});

const stream = OpenAIStream(response, {
  onStart: async () => {
    await db.insert(dbMessages).values({
      chatId,
      content: lastMessage.content,
      role: "user",
    });
  },
  onCompletion: async (completion) => {
    await db.insert(dbMessages).values({
      chatId,
      content: completion,
      role: "system",
    });
  },
});

return new StreamingTextResponse(stream);

The response is streamed back to the user in real-time, providing instant feedback as the AI generates the answer.

Benefits of streaming

Perceived Performance - Users see responses immediately instead of waiting for complete generation
Better UX - Users can start reading while the AI is still writing
Error Handling - If generation fails mid-stream, users still see partial responses
Lower Time-to-First-Byte - Critical for Edge Runtime cold starts

Pipeline Performance

Document Processing Metrics

PDF Download: ~1-3 seconds (depends on file size)
Text Extraction: ~0.5-2 seconds per page
Embedding Generation: ~0.3 seconds per chunk
Pinecone Upload: ~1-2 seconds (batch operation)

Total ingestion time for a 10-page PDF: ~15-30 seconds

Query Performance Metrics

Query Embedding: ~0.3 seconds
Pinecone Search: ~50-200ms
GPT-4 First Token: ~1-2 seconds
Streaming Completion: ~3-10 seconds (varies by response length)

Total time to first token: ~1.5-2.5 seconds

All timings assume Edge Runtime deployment with optimal network conditions.

Error Handling

The pipeline includes robust error handling at each stage:

// src/app/api/chat/route.ts:74
try {
  // ... pipeline logic
} catch (error) {
  return NextResponse.json(
    { error: "internal server error" },
    { status: 500 }
  );
}

Common error scenarios:

S3 Download Fails - Returns error before processing
OpenAI API Error - Caught and logged, returns 500
Pinecone Timeout - Automatically retried by SDK
No Context Found - AI responds with “I don’t know”

Optimization Opportunities

Potential improvements

Caching

Cache query embeddings for common questions
Cache Pinecone search results (with TTL)

Parallel Processing

Process PDF pages in parallel
Batch embedding API calls

Smarter Chunking

Use semantic chunking instead of character-based
Preserve section boundaries

Hybrid Search

Combine vector search with keyword search
Re-rank results using cross-encoder models

Context Optimization

Dynamically adjust context window based on query complexity
Use compression techniques for longer contexts

Architecture

Integrations

API Reference

What is RAG?

Pipeline Architecture

Stage 1: Document Ingestion

Stage 2: Context Retrieval

Stage 3: Response Generation

Document Processing

Step 1: PDF Download from S3

Step 2: PDF Text Extraction

Step 3: Document Chunking

Step 4: Byte Truncation

Step 5: Embedding Generation

Step 6: Pinecone Upload

Context Retrieval

Step 1: Query Embedding

Step 2: Vector Similarity Search

Step 3: Filtering by Relevance

Step 4: Context Assembly

Response Generation

System Prompt Construction

Streaming Chat Completion

Pipeline Performance

Document Processing Metrics

Query Performance Metrics

Error Handling

Optimization Opportunities

Caching

Parallel Processing

Smarter Chunking

Hybrid Search

Context Optimization

Build docs developers (and LLMs) love

Architecture

Integrations

API Reference

​What is RAG?

​Pipeline Architecture

​Stage 1: Document Ingestion

​Stage 2: Context Retrieval

​Stage 3: Response Generation

​Document Processing

​Step 1: PDF Download from S3

​Step 2: PDF Text Extraction

​Step 3: Document Chunking

​Step 4: Byte Truncation

​Step 5: Embedding Generation

​Step 6: Pinecone Upload

​Context Retrieval

​Step 1: Query Embedding

​Step 2: Vector Similarity Search

​Step 3: Filtering by Relevance

​Step 4: Context Assembly

​Response Generation

​System Prompt Construction

​Streaming Chat Completion

​Pipeline Performance

​Document Processing Metrics

​Query Performance Metrics

​Error Handling

​Optimization Opportunities

​Caching

​Parallel Processing

​Smarter Chunking

​Hybrid Search

​Context Optimization

Build docs developers (and LLMs) love

What is RAG?

Pipeline Architecture

Stage 1: Document Ingestion

Stage 2: Context Retrieval

Stage 3: Response Generation

Document Processing

Step 1: PDF Download from S3

Step 2: PDF Text Extraction

Step 3: Document Chunking

Step 4: Byte Truncation

Step 5: Embedding Generation

Step 6: Pinecone Upload

Context Retrieval

Step 1: Query Embedding

Step 2: Vector Similarity Search

Step 3: Filtering by Relevance

Step 4: Context Assembly

Response Generation

System Prompt Construction

Streaming Chat Completion

Pipeline Performance

Document Processing Metrics

Query Performance Metrics

Error Handling

Optimization Opportunities

Caching

Parallel Processing

Smarter Chunking

Hybrid Search

Context Optimization