PDF Upload

Overview

The PDF upload feature allows users to upload PDF documents up to 10MB in size. Once uploaded, PDFs are automatically stored in Amazon S3 and processed into vector embeddings using Pinecone for intelligent document search and retrieval.

How It Works

Select Your PDF

Drag and drop a PDF file or click to browse. Only .pdf files are accepted, with a maximum size of 10MB.

Upload to S3

Your PDF is securely uploaded to Amazon S3 with a unique file key generated using a timestamp.

Process & Embed

The PDF is downloaded from S3, split into chunks, and converted into vector embeddings stored in Pinecone for semantic search.

Create Chat

A new chat session is created and linked to your PDF, allowing you to start asking questions immediately.

File Requirements

Supported Format: PDF files only (.pdf)Maximum Size: 10MB per fileStorage Region: Asia Pacific (Mumbai) - ap-south-1

Upload Implementation

The upload component uses react-dropzone for drag-and-drop functionality and AWS SDK for S3 uploads:

src/components/ui/FileUpload.tsx

const { getRootProps, getInputProps } = useDropzone({
  accept: { "application/pdf": [".pdf"] },
  maxFiles: 1,
  onDrop: async (acceptedFiles) => {
    const file = acceptedFiles[0];
    if (file.size > 10 * 1024 * 1024) {
      toast.error(
        "File size too large. Please upload a file less than 10 MB."
      );
      return;
    }
    try {
      setUploading(true);
      const data = await uploadToS3(file);
      if (!data?.file_key || !data?.file_name) {
        toast.error("Error uploading file");
        return;
      }
      mutate(data, {
        onSuccess: ({ chat_id }) => {
          toast.success("Chat created successfully! Redirecting...");
          router.push(`/chat/${chat_id}`);
        },
      });
    } catch (error) {
      console.log(error);
    } finally {
      setUploading(false);
    }
  },
});

S3 Storage

Uploaded PDFs are stored in Amazon S3 with automatic progress tracking:

src/lib/s3.ts

export async function uploadToS3(file: File) {
  try {
    AWS.config.update({
      accessKeyId: process.env.NEXT_PUBLIC_S3_ACCESS_KEY_ID,
      secretAccessKey: process.env.NEXT_PUBLIC_S3_SECRET_ACCESS_KEY,
    });
    const s3 = new AWS.S3({
      params: {
        Bucket: process.env.NEXT_PUBLIC_S3_BUCKET_NAME,
      },
      region: "ap-south-1",
    });
    const file_key =
      "uploads/" + Date.now().toString() + file.name.replace(" ", "-");

    const params = {
      Bucket: process.env.NEXT_PUBLIC_S3_BUCKET_NAME!,
      Key: file_key,
      Body: file,
    };

    const upload = s3
      .putObject(params)
      .on("httpUploadProgress", (evt) => {
        console.log(
          "upload progress",
          parseInt(((evt.loaded * 100) / evt.total).toString()) + "%"
        );
      })
      .promise();

    await upload.then((data) => {
      toast.success("File uploaded successfully");
    });

    return Promise.resolve({
      file_key,
      file_name: file.name,
    });
  } catch (error) {
    console.log("upload error", error);
  }
}

Vector Embedding Process

After upload, PDFs are processed and embedded into Pinecone for semantic search:

src/lib/pinecone.ts

export async function loadS3IntoPinecone(fileKey: string) {
  // Download the PDF from S3
  console.log("downloading from s3...");
  const file_name = await downloadFromS3(fileKey);
  if (!file_name) {
    throw new Error("unable to download file from s3");
  }
  const loader = new PDFLoader(file_name);
  const pages = (await loader.load()) as PDFPage[];
  
  // Split the document into smaller segments
  const documents = await Promise.all(pages.map(prepareDocument));

  // Vectorize and embed each document
  const vectors = await Promise.all(documents.flat().map(embedDocument));

  // Upload the vectors to Pinecone
  const client = await getPineconeClient();
  const pineconeIndex = await client.Index("aipdf");
  const namespace = pineconeIndex.namespace(convertToAscii(fileKey));
  console.log("uploading to pinecone...");

  await namespace.upsert(vectors);
  return documents[0];
}

Documents are split using RecursiveCharacterTextSplitter from LangChain and truncated to 36,000 bytes to meet Pinecone’s metadata size limits.

Chat Creation API

Once the PDF is processed, a chat session is created:

src/app/api/create-chat/route.ts

export async function POST(req: NextRequest, res: NextResponse) {
  const { userId } = getAuth(req);
  if (!userId) {
    return NextResponse.json(
      { error: "Authentication error" },
      { status: 401 }
    );
  }

  try {
    const body = await req.json();
    const { file_key, file_name } = body;
    
    // Load PDF into Pinecone for vector search
    await loadS3IntoPinecone(file_key);
    
    // Create chat record in database
    const chat_id = await db
      .insert(chats)
      .values({
        fileKey: file_key,
        pdfName: file_name,
        pdfUrl: getS3Url(file_key),
        userId: userId,
      })
      .returning({
        insertedId: chats.id,
      });
    return NextResponse.json(
      { chat_id: chat_id[0].insertedId },
      { status: 200 }
    );
  } catch (error) {
    console.log(error);
    return NextResponse.json(
      { error: "internal server error" },
      { status: 500 }
    );
  }
}

Troubleshooting

File size exceeds 10MB

The maximum file size is 10MB. Consider compressing your PDF or splitting it into smaller files. This limit ensures fast upload times and optimal processing performance.

Upload fails with authentication error

Ensure you’re logged in with a valid Clerk session. The upload process requires authentication to associate the PDF with your user account.

PDF processing takes too long

Large PDFs with many pages may take 30-60 seconds to process as they are being:

Uploaded to S3
Downloaded for processing
Split into chunks
Converted to embeddings
Uploaded to Pinecone vector database

PDF files with scanned images or complex formatting may have reduced accuracy in text extraction. For best results, use PDFs with selectable text.

Environment Variables

The following environment variables are required for PDF upload:

.env.local

NEXT_PUBLIC_S3_ACCESS_KEY_ID=your_access_key
NEXT_PUBLIC_S3_SECRET_ACCESS_KEY=your_secret_key
NEXT_PUBLIC_S3_BUCKET_NAME=your_bucket_name
PINECONE_API_KEY=your_pinecone_key
PINECONE_ENVIRONMENT=your_pinecone_environment

Get Started

Core Features

User Guide

Overview

How It Works

File Requirements

Upload Implementation

S3 Storage

Vector Embedding Process

Chat Creation API

Troubleshooting

Environment Variables

Build docs developers (and LLMs) love

Get Started

Core Features

User Guide

​Overview

​How It Works

​File Requirements

​Upload Implementation

​S3 Storage

​Vector Embedding Process

​Chat Creation API

​Troubleshooting

​Environment Variables

Build docs developers (and LLMs) love

Overview

How It Works

File Requirements

Upload Implementation

S3 Storage

Vector Embedding Process

Chat Creation API

Troubleshooting

Environment Variables