Spacebot’s ingestion system processes files from a watched directory, extracts text, chunks it, and imports each chunk as memories via the standard memory recall + save flow.
Text is split at line boundaries to preserve semantic units:
src/agent/ingestion.rs
fn chunk_text(text: &str, chunk_size: usize) -> Vec<String> { // If adding this line exceeds chunk_size and we have content, finalize chunk // Long lines exceeding chunk_size get their own chunk}
Example:
Input (10,000 chars, chunk_size=4000): Line 1 (500 chars) Line 2 (1500 chars) Line 3 (2500 chars) <- chunk 1 (4500 total, exceeds limit) Line 4 (3000 chars) <- chunk 2 Line 5 (2500 chars) <- chunk 3Chunks: 1. Lines 1-2 (2000 chars) 2. Line 3 (2500 chars) 3. Line 4 (3000 chars) 4. Line 5 (2500 chars)
CREATE TABLE ingestion_progress ( content_hash TEXT NOT NULL, chunk_index INTEGER NOT NULL, total_chunks INTEGER NOT NULL, filename TEXT NOT NULL, PRIMARY KEY (content_hash, chunk_index));CREATE TABLE ingestion_files ( content_hash TEXT PRIMARY KEY, filename TEXT NOT NULL, file_size INTEGER NOT NULL, total_chunks INTEGER NOT NULL, status TEXT NOT NULL, -- queued, processing, completed, failed created_at TEXT NOT NULL, completed_at TEXT);
If ingestion is interrupted:
1
Restart
Server restarts. Ingestion loop resumes.
2
Scan
File is still in ingest/ directory.
3
Resume
Load completed chunks from ingestion_progress.
4
Skip Completed
Only process chunks not in the completed set.
5
Finish
When all chunks succeed, delete file and progress records.
If any chunk errors (e.g., provider 401, rate limit):
The file stays in ingest/
Progress records persist
Status is marked failed in ingestion_files
The next poll retries failed chunks
src/agent/ingestion.rs
if had_failure { // Keep the source file and progress for retry return Ok(());}// Full success: clean updelete_progress(&pool, &hash).await?;tokio::fs::remove_file(path).await?;
This prevents data loss when transient errors interrupt ingestion.
curl -X POST http://localhost:3000/api/agents/spacebot/ingest \ -F "[email protected]"
The API writes the file to {workspace}/ingest/ and marks it as queued in ingestion_files. The polling loop picks it up on the next scan.Query ingestion status:
Files over 1MB should be split before ingestion. Chunking happens in-memory—extremely large files may hit memory limits.
Clean Up Noise
Remove headers, footers, or boilerplate before ingesting. The LLM processes everything, so cleaner input = better memories.
Monitor Failures
Check the ingestion_files table for failed status. Common causes: rate limits, invalid UTF-8, or corrupted PDFs.
Use Descriptive Filenames
The filename is included in the chunk prompt. 2026-02-28-standup.txt is better than notes.txt.
Ingestion is not instant. A 100KB file (~25 chunks) takes 2-5 minutes to process depending on LLM latency. For bulk imports, enable ingestion and let it run overnight.