Ingestion pipeline

The ingestion pipeline takes a document buffer, extracts all text and visual content, enriches images and tables with LLM-generated descriptions, and stores the result in Qdrant. The entire process is orchestrated by a single entry point.

Entry point

export const ingestDocument = async (
  fileBuffer: Buffer,
  fileName: string,
  tags: Tags,
): Promise<IngestionResult>

tags scopes every chunk produced from this document so you can later filter searches to a specific institution, course, or study mode.

Tags type

interface Tags {
  mode?: Mode;
  institution?: string;
  courseName?: string;
}

IngestionResult type

type IngestionResult = {
  success: boolean;
  totalChunks: number;
  visualChunks: number;
};

visualChunks counts the elements that received a visual_description in their metadata — a useful signal for verifying that image extraction and vision analysis ran correctly.

Pipeline stages

partitionDocument — text extraction

Calls Unstructured.io to split the document into structured elements. The call runs in parallel with image extraction to reduce overall ingestion time.

export const partitionDocument = async (
  fileBuffer: Buffer,
  fileName: string,
)

Key parameters passed to Unstructured.io:

Parameter	Value	Effect
`strategy`	`HiRes`	Highest-fidelity layout analysis
`chunkingStrategy`	`by_title`	Splits on heading boundaries
`maxCharacters`	`1500`	Hard cap per chunk (`MAXCHAR`)
`extractImageBlockTypes`	`["Image", "Table", "Figure", "Graphic"]`	Captures visual elements inline
`pdfInferTableStructure`	`true`	Reconstructs table HTML
`splitPdfConcurrencyLevel`	`15`	Parallel PDF page processing

getLocalImages — image extraction

Runs in parallel with partitionDocument. Spawns a pdfplumber Python worker via vision-bridge.ts that scans every page and returns a map of base64-encoded images.

export const getLocalImages = (
  pdfPath: string,
): Promise<Record<number, string[]>>

The return type is a dictionary where each key is a page number and each value is an ordered list of base64 image strings found on that page. The Python process runs inside the project’s venv so it is isolated from the Node.js runtime.

visionMaker — sync layer

Merges the text elements from Unstructured.io and the images from pdfplumber at the metadata level.

export const visionMaker = (
  raw: PartitionResponse,
  localImages: Record<number, string[]>,
  fileName: string,
): DocumentElement[]

The merge strategy:

For each CompositeElement (a chunked text block), if localImages has images for that page, the first available image is attached to the element’s metadata.image_base64 field.
Any images that remain after the merge (pages where Unstructured.io produced no text block, or pages with more images than text blocks) are appended as new Image elements with a synthetic element_id.
All elements are sorted by page_number before being returned.

This is the “sync layer” described in the architecture overview. Text and images never lose their positional relationship to the original document.

describeVisualElements — vision analysis

Iterates over every element and, for those with an attached image or HTML table, calls a vision LLM to produce a rich text description. Concurrency is limited to 3 parallel calls (pLimit(3)) to respect API rate limits.

export const describeVisualElements = async (
  elements: DocumentElement[],
): Promise<DocumentElement[]>

Two prompt types are used, selected by element type:

Diagram prompt (Image / Figure)
Table prompt

"You are an expert academic illustrator and educator. Analyze this educational diagram in detail.

Instructions:
  1. Identify the Concept: What is the primary scientific or academic topic?
  2. Transcribe All Labels: List every piece of text, label, or annotation visible in the image.
  3. Visual Description: Describe the relationship between the parts.
  4. Educational Context: Explain what a student should learn from this specific visual.

Goal: Produce a keyword-rich text description that allows a search engine to find this image
when a student asks about its specific components or the broader topic."

"You are a professional Data Analyst. You are looking at a screenshot of a table from a technical document.

Instructions:
  1. Header Identification: Identify the names of the columns and the primary row categories.
  2. Data Summary: Summarize the most important values and highlight trends.
  3. Relationship Analysis: Explain the correlation between the variables.
  4. Key Takeaway: The one 'finding' a student must remember.

Goal: Convert this visual grid into a descriptive narrative so that the data points
become searchable in a vector database."

For HTML tables (those with a text_as_html field), Quark converts the HTML to Markdown using marked instead of calling the vision LLM, since the structured data is already machine-readable.After analysis, the description is appended to the element’s text field as [Visual Analysis]: <description> and stored in metadata.visual_description (truncated to 500 characters). The raw base64 is cleared from the stored payload.

processMetadata — embedding and storage

Generates embeddings for all enriched elements and upserts them into Qdrant in batches.

export const processMetadata = async (
  elements: DocumentElement[],
  tags: Tags,
)

Batching behaviour:

Elements are processed in batches of 12 (BATCHSIZE = 12).
Between batches, the pipeline sleeps for 21 seconds to respect VoyageAI’s rate limits.
Each batch is embedded with EmbedRequestInputType.Document and upserted to Qdrant in a single wait: true call.

Payload stored per vector point:

{
  text: string,           // enriched text (including visual analysis)
  page_number: number,
  isVisual: boolean,      // true for Image and Table elements
  imageUrl: string | null,
  institution: string,    // from Tags
  mode: string,           // from Tags
  courseName: string,     // from Tags
  chunkIndex: number,     // position within the document
}

Constants reference

Constant	Value	Effect
`MAXCHAR`	`1500`	Maximum characters per text chunk
`BATCHSIZE`	`12`	Chunks embedded and upserted per batch
`limit`	`pLimit(3)`	Max concurrent vision LLM calls
`TTL_SECONDS`	`1800`	Session TTL in Redis (see memory system)

The pipeline runs partitionDocument and getLocalImages in parallel using Promise.all. On large PDFs this can cut ingestion time significantly compared to running them sequentially.

Get Started

Architecture

Using Quark

Self-Hosting

Ingestion pipeline

Entry point

Tags type

IngestionResult type

Pipeline stages

Constants reference

Build docs developers (and LLMs) love

Get Started

Architecture

Using Quark

Self-Hosting

​Entry point

​Tags type

​IngestionResult type

​Pipeline stages

​Constants reference

Build docs developers (and LLMs) love

Entry point

Tags type

IngestionResult type

Pipeline stages

Constants reference