Arcana.Ingest module handles document ingestion for Arcana’s RAG pipeline. It processes text or files by chunking content, generating embeddings, and optionally extracting entities and relationships for GraphRAG.
Overview
Ingestion transforms raw content into searchable knowledge:- Parse - Extract text from files (PDF, Markdown, plain text)
- Chunk - Split content into semantic segments
- Embed - Generate vector embeddings for each chunk
- Store - Persist chunks with embeddings in your database
- Graph (optional) - Extract entities and relationships for enhanced retrieval
Functions
ingest/2
Ingests text content, creating a document with embedded chunks.The text content to ingest. Can be of any length - it will be automatically chunked.
Ingestion options:
The Ecto repo module to use for database operations. Required unless configured globally via
config :arcana, repo: MyApp.Repo.Optional identifier for grouping or filtering documents. Useful for tracking document sources (e.g., “user_123”, “website_page”).
Optional metadata map to store with the document. Can contain any key-value pairs for filtering or display purposes.
Maximum chunk size. The unit depends on the
size_unit option::characters- size in characters (default):tokens- size in tokens
Number of characters or tokens to overlap between consecutive chunks. Helps maintain context across chunk boundaries.
Unit for chunk_size and chunk_overlap:
:characters or :tokensCollection to organize the document. Can be:
- A string:
"my_collection" - A map:
%{name: "my_collection", description: "Documentation for X"}
Enable or disable GraphRAG entity/relationship extraction for this document. Overrides the global configuration setting.
Format hint for the chunker (e.g.,
:markdown, :code). Some chunkers may use this to improve chunking quality.Returns the created document struct containing:
id(binary_id) - Unique identifier for the documentcontent(string) - The original text contentsource_id(string | nil) - Source identifier if providedmetadata(map) - Document metadatastatus(atom) - Processing status::completed,:failed, etc.chunk_count(integer) - Number of chunks created from the documentcollection_id(binary_id) - ID of the associated collectioncontent_type(string) - MIME type (defaults to “text/plain”)inserted_at(DateTime) - Creation timestampupdated_at(DateTime) - Last update timestamp
Returns an error tuple if ingestion fails. Common errors:
{:error, {:embedding_failed, reason}}- Failed to generate embeddings{:error, :invalid_text}- Text is nil or empty
ingest_file/2
Ingests a file by parsing its content and creating a document with embedded chunks.Absolute or relative path to the file to ingest. Supported formats:
- Plain text (
.txt) - Markdown (
.md,.markdown) - PDF (
.pdf) - Other formats via custom parsers
Same options as
ingest/2:The Ecto repo to use
Optional source identifier
Optional metadata map
Maximum chunk size
Overlap between chunks
Collection name or map
Enable/disable GraphRAG extraction
Returns the created document with additional fields:
file_path(string) - Path to the ingested filecontent_type(string) - Detected MIME type (e.g., “text/markdown”, “application/pdf”)
Returns an error tuple if file processing fails:
{:error, :file_not_found}- File doesn’t exist{:error, :parse_failed}- Failed to parse file{:error, {:embedding_failed, reason}}- Embedding generation failed
Chunking Strategies
Arcana supports different chunking strategies via the configured chunker:Character-based Chunking
Split text by character count with overlap:Token-based Chunking
Split text by token count for better LLM compatibility:Format-aware Chunking
Provide format hints for better chunking:Collections
Collections organize documents in your knowledge base:GraphRAG Integration
When GraphRAG is enabled, Arcana extracts entities and relationships during ingestion:Telemetry Events
The ingest module emits telemetry events for monitoring:[:arcana, :ingest, :start]- Ingestion started[:arcana, :ingest, :stop]- Ingestion completed[:arcana, :ingest, :exception]- Ingestion failed
Error Handling
Related Functions
- Arcana.ingest/2 - Main module function
- Arcana.ingest_file/2 - Main module function
- Arcana.search/2 - Search ingested content