Document ingestion

Ingestion runs a document through the full pipeline: Unstructured.io partitions the text, pdfplumber extracts images, VoyageAI embeds each chunk, and the resulting vectors are upserted into Qdrant.

Ingestion requires a valid Unstructured.io API key and URL configured in your .env file. The pipeline exits with an error if this is not set. See Configuration for details.

CLI ingestion

Use the :ingest command inside the running CLI:

:ingest <path> [institution] [course]

Run the ingest command

Provide a path to your PDF. Wrap paths that contain spaces in double quotes:

# Simple path
:ingest /home/user/notes.pdf

# Path with spaces
:ingest "/home/user/lecture notes/week3.pdf" MIT biology

# Full metadata
:ingest /home/user/chem101.pdf Stanford chemistry

Wait for ingestion to complete

A progress indicator is shown while the document is processed. When ingestion finishes, Quark prints the chunk counts:

✓  notes.pdf — 42 chunks  3 visual

Verify with :docs

Run :docs to confirm the document appears in the active session’s ingest log.

Arguments

Argument	Required	Default	Description
`path`	Yes	—	Absolute or relative path to the PDF. Quote if the path contains spaces.
`institution`	No	`"Default"`	Tag used to scope vector search. Persists for the rest of the session.
`course`	No	(none)	Free-text label for the course or topic.

API ingestion

The API ingestion flow is two steps: first obtain a presigned S3 upload URL, then trigger processing.

Get a presigned upload URL

POST /api/v1/ingest/upload/url

Quark returns { signedUrl, key }. Upload the file directly to signedUrl using a PUT request with the raw file bytes in the request body. Allowed types: application/pdf, image/jpeg, image/jpg. Maximum size: 50 MB.

Trigger processing

Once the file is uploaded, send the file key and optional tags to the processing endpoint:

POST /api/v1/ingest/process

Request body

key

string

required

The object key returned by the upload URL step (uploadData.key).

filename

string

required

The original filename of the document (e.g. report.pdf).

session_id

string

required

The session to associate the ingested document with.

Response

totalChunks

number

Number of text chunks extracted and embedded.

visualChunks

number

Number of image chunks extracted and embedded.

Ingest multiple documents tagged with the same institution value, then use that value when querying to scope answers to a specific document set.

Get Started

Architecture

Using Quark

Self-Hosting

Document ingestion

CLI ingestion

Arguments

API ingestion

Response

Build docs developers (and LLMs) love

Get Started

Architecture

Using Quark

Self-Hosting

​CLI ingestion

​Arguments

​API ingestion

​Response

Build docs developers (and LLMs) love

CLI ingestion

Arguments

API ingestion

Response