Skip to main content
The ingestion pipeline takes a document buffer, extracts all text and visual content, enriches images and tables with LLM-generated descriptions, and stores the result in Qdrant. The entire process is orchestrated by a single entry point.

Entry point

export const ingestDocument = async (
  fileBuffer: Buffer,
  fileName: string,
  tags: Tags,
): Promise<IngestionResult>
tags scopes every chunk produced from this document so you can later filter searches to a specific institution, course, or study mode.

Tags type

interface Tags {
  mode?: Mode;
  institution?: string;
  courseName?: string;
}

IngestionResult type

type IngestionResult = {
  success: boolean;
  totalChunks: number;
  visualChunks: number;
};
visualChunks counts the elements that received a visual_description in their metadata — a useful signal for verifying that image extraction and vision analysis ran correctly.

Pipeline stages

1

partitionDocument — text extraction

Calls Unstructured.io to split the document into structured elements. The call runs in parallel with image extraction to reduce overall ingestion time.
export const partitionDocument = async (
  fileBuffer: Buffer,
  fileName: string,
)
Key parameters passed to Unstructured.io:
ParameterValueEffect
strategyHiResHighest-fidelity layout analysis
chunkingStrategyby_titleSplits on heading boundaries
maxCharacters1500Hard cap per chunk (MAXCHAR)
extractImageBlockTypes["Image", "Table", "Figure", "Graphic"]Captures visual elements inline
pdfInferTableStructuretrueReconstructs table HTML
splitPdfConcurrencyLevel15Parallel PDF page processing
2

getLocalImages — image extraction

Runs in parallel with partitionDocument. Spawns a pdfplumber Python worker via vision-bridge.ts that scans every page and returns a map of base64-encoded images.
export const getLocalImages = (
  pdfPath: string,
): Promise<Record<number, string[]>>
The return type is a dictionary where each key is a page number and each value is an ordered list of base64 image strings found on that page. The Python process runs inside the project’s venv so it is isolated from the Node.js runtime.
3

visionMaker — sync layer

Merges the text elements from Unstructured.io and the images from pdfplumber at the metadata level.
export const visionMaker = (
  raw: PartitionResponse,
  localImages: Record<number, string[]>,
  fileName: string,
): DocumentElement[]
The merge strategy:
  • For each CompositeElement (a chunked text block), if localImages has images for that page, the first available image is attached to the element’s metadata.image_base64 field.
  • Any images that remain after the merge (pages where Unstructured.io produced no text block, or pages with more images than text blocks) are appended as new Image elements with a synthetic element_id.
  • All elements are sorted by page_number before being returned.
This is the “sync layer” described in the architecture overview. Text and images never lose their positional relationship to the original document.
4

describeVisualElements — vision analysis

Iterates over every element and, for those with an attached image or HTML table, calls a vision LLM to produce a rich text description. Concurrency is limited to 3 parallel calls (pLimit(3)) to respect API rate limits.
export const describeVisualElements = async (
  elements: DocumentElement[],
): Promise<DocumentElement[]>
Two prompt types are used, selected by element type:
"You are an expert academic illustrator and educator. Analyze this educational diagram in detail.

Instructions:
  1. Identify the Concept: What is the primary scientific or academic topic?
  2. Transcribe All Labels: List every piece of text, label, or annotation visible in the image.
  3. Visual Description: Describe the relationship between the parts.
  4. Educational Context: Explain what a student should learn from this specific visual.

Goal: Produce a keyword-rich text description that allows a search engine to find this image
when a student asks about its specific components or the broader topic."
For HTML tables (those with a text_as_html field), Quark converts the HTML to Markdown using marked instead of calling the vision LLM, since the structured data is already machine-readable.After analysis, the description is appended to the element’s text field as [Visual Analysis]: <description> and stored in metadata.visual_description (truncated to 500 characters). The raw base64 is cleared from the stored payload.
5

processMetadata — embedding and storage

Generates embeddings for all enriched elements and upserts them into Qdrant in batches.
export const processMetadata = async (
  elements: DocumentElement[],
  tags: Tags,
)
Batching behaviour:
  • Elements are processed in batches of 12 (BATCHSIZE = 12).
  • Between batches, the pipeline sleeps for 21 seconds to respect VoyageAI’s rate limits.
  • Each batch is embedded with EmbedRequestInputType.Document and upserted to Qdrant in a single wait: true call.
Payload stored per vector point:
{
  text: string,           // enriched text (including visual analysis)
  page_number: number,
  isVisual: boolean,      // true for Image and Table elements
  imageUrl: string | null,
  institution: string,    // from Tags
  mode: string,           // from Tags
  courseName: string,     // from Tags
  chunkIndex: number,     // position within the document
}

Constants reference

ConstantValueEffect
MAXCHAR1500Maximum characters per text chunk
BATCHSIZE12Chunks embedded and upserted per batch
limitpLimit(3)Max concurrent vision LLM calls
TTL_SECONDS1800Session TTL in Redis (see memory system)
The pipeline runs partitionDocument and getLocalImages in parallel using Promise.all. On large PDFs this can cut ingestion time significantly compared to running them sequentially.

Build docs developers (and LLMs) love