Arcana.Ingest

The Arcana.Ingest module handles document ingestion for Arcana’s RAG pipeline. It processes text or files by chunking content, generating embeddings, and optionally extracting entities and relationships for GraphRAG.

Overview

Ingestion transforms raw content into searchable knowledge:

Parse - Extract text from files (PDF, Markdown, plain text)
Chunk - Split content into semantic segments
Embed - Generate vector embeddings for each chunk
Store - Persist chunks with embeddings in your database
Graph (optional) - Extract entities and relationships for enhanced retrieval

Functions

ingest/2

Ingests text content, creating a document with embedded chunks.

ingest(text, opts) :: {:ok, %Arcana.Document{}} | {:error, term()}

text

string

required

The text content to ingest. Can be of any length - it will be automatically chunked.

opts

keyword

required

Ingestion options:

repo

module

required

The Ecto repo module to use for database operations. Required unless configured globally via config :arcana, repo: MyApp.Repo.

source_id

string

Optional identifier for grouping or filtering documents. Useful for tracking document sources (e.g., “user_123”, “website_page”).

metadata

map

Optional metadata map to store with the document. Can contain any key-value pairs for filtering or display purposes.

chunk_size

integer

default:"1024"

Maximum chunk size. The unit depends on the size_unit option:

:characters - size in characters (default)
:tokens - size in tokens

chunk_overlap

integer

default:"200"

Number of characters or tokens to overlap between consecutive chunks. Helps maintain context across chunk boundaries.

size_unit

atom

default:":characters"

Unit for chunk_size and chunk_overlap: :characters or :tokens

collection

string | map

default:"\"default\""

Collection to organize the document. Can be:

A string: "my_collection"
A map: %{name: "my_collection", description: "Documentation for X"}

Collections help organize and filter documents in your knowledge base.

graph

boolean

Enable or disable GraphRAG entity/relationship extraction for this document. Overrides the global configuration setting.

format

atom

Format hint for the chunker (e.g., :markdown, :code). Some chunkers may use this to improve chunking quality.

{:ok, %Arcana.Document{}}

Returns the created document struct containing:

id (binary_id) - Unique identifier for the document
content (string) - The original text content
source_id (string | nil) - Source identifier if provided
metadata (map) - Document metadata
status (atom) - Processing status: :completed, :failed, etc.
chunk_count (integer) - Number of chunks created from the document
collection_id (binary_id) - ID of the associated collection
content_type (string) - MIME type (defaults to “text/plain”)
inserted_at (DateTime) - Creation timestamp
updated_at (DateTime) - Last update timestamp

error

{:error, term()}

Returns an error tuple if ingestion fails. Common errors:

{:error, {:embedding_failed, reason}} - Failed to generate embeddings
{:error, :invalid_text} - Text is nil or empty

Examples:

# Basic text ingestion
{:ok, document} = Arcana.Ingest.ingest(
  "Elixir is a dynamic, functional language for building scalable applications.",
  repo: MyApp.Repo
)

IO.inspect(document.chunk_count)
# 1

# Long article with metadata
{:ok, document} = Arcana.Ingest.ingest(
  article_text,
  repo: MyApp.Repo,
  source_id: "blog_post_456",
  metadata: %{
    title: "Introduction to Elixir",
    author: "Jane Doe",
    published_at: ~D[2024-01-15],
    tags: ["elixir", "programming", "functional"]
  },
  chunk_size: 512,
  chunk_overlap: 100
)

# Organize in a collection with description
{:ok, document} = Arcana.Ingest.ingest(
  technical_docs,
  repo: MyApp.Repo,
  collection: %{
    name: "api_documentation",
    description: "REST API reference documentation"
  },
  chunk_size: 2048,
  chunk_overlap: 400
)

# Token-based chunking
{:ok, document} = Arcana.Ingest.ingest(
  code_documentation,
  repo: MyApp.Repo,
  chunk_size: 500,
  size_unit: :tokens,
  chunk_overlap: 50,
  format: :code
)

# Disable GraphRAG for specific document
{:ok, document} = Arcana.Ingest.ingest(
  simple_content,
  repo: MyApp.Repo,
  graph: false
)

ingest_file/2

Ingests a file by parsing its content and creating a document with embedded chunks.

ingest_file(path, opts) :: {:ok, %Arcana.Document{}} | {:error, term()}

path

string

required

Absolute or relative path to the file to ingest. Supported formats:

Plain text (.txt)
Markdown (.md, .markdown)
PDF (.pdf)
Other formats via custom parsers

opts

keyword

required

Same options as ingest/2:

repo

module

required

The Ecto repo to use

source_id

string

Optional source identifier

metadata

map

Optional metadata map

chunk_size

integer

default:"1024"

Maximum chunk size

chunk_overlap

integer

default:"200"

Overlap between chunks

collection

string | map

default:"\"default\""

Collection name or map

graph

boolean

Enable/disable GraphRAG extraction

{:ok, %Arcana.Document{}}

Returns the created document with additional fields:

file_path (string) - Path to the ingested file
content_type (string) - Detected MIME type (e.g., “text/markdown”, “application/pdf”)

error

{:error, term()}

Returns an error tuple if file processing fails:

{:error, :file_not_found} - File doesn’t exist
{:error, :parse_failed} - Failed to parse file
{:error, {:embedding_failed, reason}} - Embedding generation failed

Examples:

# Ingest a markdown file
{:ok, document} = Arcana.Ingest.ingest_file(
  "./docs/README.md",
  repo: MyApp.Repo,
  collection: "documentation"
)

IO.inspect(document.content_type)
# "text/markdown"

IO.inspect(document.file_path)
# "./docs/README.md"

# Ingest a PDF with metadata
{:ok, document} = Arcana.Ingest.ingest_file(
  "/path/to/annual_report.pdf",
  repo: MyApp.Repo,
  source_id: "report_2024",
  metadata: %{
    year: 2024,
    department: "finance",
    confidential: false
  },
  chunk_size: 2048,
  chunk_overlap: 400
)

# Ingest multiple files in a loop
for file <- Path.wildcard("docs/**/*.md") do
  {:ok, _doc} = Arcana.Ingest.ingest_file(
    file,
    repo: MyApp.Repo,
    source_id: "docs_import",
    metadata: %{file: Path.basename(file)},
    collection: "knowledge_base"
  )
end

# Handle errors
case Arcana.Ingest.ingest_file(path, repo: MyApp.Repo) do
  {:ok, document} ->
    Logger.info("Ingested #{document.chunk_count} chunks")
    
  {:error, :file_not_found} ->
    Logger.error("File not found: #{path}")
    
  {:error, reason} ->
    Logger.error("Ingestion failed: #{inspect(reason)}")
end

Chunking Strategies

Arcana supports different chunking strategies via the configured chunker:

Character-based Chunking

Split text by character count with overlap:

Arcana.Ingest.ingest(
  long_text,
  repo: MyApp.Repo,
  chunk_size: 1024,        # 1024 characters per chunk
  chunk_overlap: 200,      # 200 character overlap
  size_unit: :characters   # explicit (default)
)

Token-based Chunking

Split text by token count for better LLM compatibility:

Arcana.Ingest.ingest(
  documentation,
  repo: MyApp.Repo,
  chunk_size: 500,      # 500 tokens per chunk
  chunk_overlap: 50,    # 50 token overlap
  size_unit: :tokens
)

Format-aware Chunking

Provide format hints for better chunking:

# Markdown-aware chunking (respects headings, lists, code blocks)
Arcana.Ingest.ingest(
  markdown_content,
  repo: MyApp.Repo,
  format: :markdown
)

# Code-aware chunking (respects function boundaries, etc.)
Arcana.Ingest.ingest(
  source_code,
  repo: MyApp.Repo,
  format: :code
)

Collections

Collections organize documents in your knowledge base:

# Simple collection name
Arcana.Ingest.ingest(
  text,
  repo: MyApp.Repo,
  collection: "user_manuals"
)

# Collection with description
Arcana.Ingest.ingest(
  text,
  repo: MyApp.Repo,
  collection: %{
    name: "engineering_docs",
    description: "Internal engineering documentation and runbooks"
  }
)

# Later, filter searches by collection
Arcana.search(
  "how to deploy",
  repo: MyApp.Repo,
  collection: "engineering_docs"
)

GraphRAG Integration

When GraphRAG is enabled, Arcana extracts entities and relationships during ingestion:

# Enable in config.exs
config :arcana,
  graph_enabled: true,
  entity_extractor: {Arcana.Graph.EntityExtractors.LLM, llm: "openai:gpt-4o-mini"}

# Ingest with automatic entity extraction
{:ok, document} = Arcana.Ingest.ingest(
  """
  John Smith is the CEO of Acme Corp, based in San Francisco.
  The company specializes in cloud infrastructure.
  """,
  repo: MyApp.Repo,
  graph: true  # explicit enable
)

# Later searches will be enhanced with graph relationships
Arcana.search(
  "Tell me about Acme Corp",
  repo: MyApp.Repo
)
# Returns chunks + related entities via graph search

Telemetry Events

The ingest module emits telemetry events for monitoring:

:telemetry.attach(
  "my-handler",
  [:arcana, :ingest, :stop],
  fn _event, measurements, metadata, _config ->
    IO.inspect(metadata.chunk_count)
    IO.inspect(measurements.duration)
  end,
  nil
)

Events:

[:arcana, :ingest, :start] - Ingestion started
[:arcana, :ingest, :stop] - Ingestion completed
[:arcana, :ingest, :exception] - Ingestion failed

Error Handling

case Arcana.Ingest.ingest(text, repo: MyApp.Repo) do
  {:ok, document} ->
    # Success - document.status will be :completed
    Logger.info("Created document with #{document.chunk_count} chunks")
    
  {:error, {:embedding_failed, reason}} ->
    # Embedding service error (check API key, rate limits, etc.)
    Logger.error("Embedding failed: #{inspect(reason)}")
    
  {:error, reason} ->
    # Other errors
    Logger.error("Ingestion failed: #{inspect(reason)}")
end

Arcana.ingest/2 - Main module function
Arcana.ingest_file/2 - Main module function
Arcana.search/2 - Search ingested content

Core API

Agent Pipeline

GraphRAG

Extensibility

Overview

Functions

ingest/2

ingest_file/2

Chunking Strategies

Character-based Chunking

Token-based Chunking

Format-aware Chunking

Collections

GraphRAG Integration

Telemetry Events

Error Handling

Build docs developers (and LLMs) love

Core API

Agent Pipeline

GraphRAG

Extensibility

​Overview

​Functions

​ingest/2

​ingest_file/2

​Chunking Strategies

​Character-based Chunking

​Token-based Chunking

​Format-aware Chunking

​Collections

​GraphRAG Integration

​Telemetry Events

​Error Handling

​Related Functions

Build docs developers (and LLMs) love

Overview

Functions

ingest/2

ingest_file/2

Chunking Strategies

Character-based Chunking

Token-based Chunking

Format-aware Chunking

Collections

GraphRAG Integration

Telemetry Events

Error Handling

Related Functions