Skip to main content
GET
/
api
/
get-document-content
Get Document Content
curl --request GET \
  --url https://api.example.com/api/get-document-content
{
  "success": true,
  "chunks": [
    {
      "chunks[].chunk_text": "<string>",
      "chunks[].chunk_index": 123
    }
  ],
  "content": "<string>",
  "chunk_count": 123,
  "error": "<string>"
}

Overview

This endpoint retrieves the chunked text content of a document as stored in the RAG vector database. It returns:
  • Individual text chunks with their index positions
  • Combined content with chunk separators
  • Total chunk count
This is useful for previewing document content, debugging RAG indexing, and understanding how documents are split for semantic search.

Request

id
integer
required
The unique identifier of the document

Response

success
boolean
required
Indicates whether the request succeeded
chunks
array
required
Array of chunk objects ordered by chunk index
chunks[].chunk_text
string
The text content of this chunk
chunks[].chunk_index
integer
The position of this chunk in the original document (0-based)
content
string
required
All chunks joined together with separator \n\n---\n\n between each chunk
chunk_count
integer
required
Total number of chunks in the document
error
string
Error message if the request failed

Example

curl -X GET "https://your-domain.com/api/get-document-content?id=42"

Success Response

{
  "success": true,
  "chunks": [
    {
      "chunk_text": "Welcome to our company handbook. This document outlines our policies, procedures, and company culture. Our mission is to provide exceptional service while maintaining a positive work environment.",
      "chunk_index": 0
    },
    {
      "chunk_text": "Employee Benefits: We offer comprehensive health insurance, 401(k) matching, unlimited PTO, and professional development opportunities. All employees are eligible for benefits after 30 days of employment.",
      "chunk_index": 1
    },
    {
      "chunk_text": "Work Schedule: Our standard work week is Monday through Friday, 9 AM to 5 PM. Remote work options are available for eligible positions. Please discuss flexible arrangements with your manager.",
      "chunk_index": 2
    }
  ],
  "content": "Welcome to our company handbook. This document outlines our policies, procedures, and company culture. Our mission is to provide exceptional service while maintaining a positive work environment.\n\n---\n\nEmployee Benefits: We offer comprehensive health insurance, 401(k) matching, unlimited PTO, and professional development opportunities. All employees are eligible for benefits after 30 days of employment.\n\n---\n\nWork Schedule: Our standard work week is Monday through Friday, 9 AM to 5 PM. Remote work options are available for eligible positions. Please discuss flexible arrangements with your manager.",
  "chunk_count": 3
}

Error Responses

Missing ID Parameter

{
  "success": false,
  "error": "Error al obtener contenido del documento"
}

Document Not Found

If the document ID doesn’t exist, the endpoint returns an empty chunk array:
{
  "success": true,
  "chunks": [],
  "content": "",
  "chunk_count": 0
}

Implementation Details

Query Logic

The endpoint queries the vectors table directly (api/get-document-content.php:15-18):
$chunks = $db->fetchAll(
    'SELECT chunk_text, chunk_index FROM vectors WHERE document_id = :id ORDER BY chunk_index ASC',
    [':id' => $id]
);
This retrieves all chunks for the document ordered by their position in the original text.

Content Joining

Chunks are joined with a visual separator (api/get-document-content.php:20):
$content = implode("\n\n---\n\n", array_column($chunks, 'chunk_text'));
The separator \n\n---\n\n makes it easy to visually distinguish between chunks when displaying the full content.

Chunk Ordering

Chunks are always returned in order by chunk_index ASC, ensuring the content appears in the same sequence as the original document.

Use Cases

Document Preview

Display a preview of document content before processing:
const { content, chunk_count } = await fetch(
  `/api/get-document-content?id=${docId}`
).then(r => r.json());

// Show first 500 characters as preview
const preview = content.substring(0, 500) + '...';
console.log(`Preview (${chunk_count} chunks total):\n${preview}`);

Chunk Analysis

Analyze chunk sizes and distribution:
const { chunks } = await fetch(
  `/api/get-document-content?id=${docId}`
).then(r => r.json());

const chunkLengths = chunks.map(c => c.chunk_text.length);
const avgLength = chunkLengths.reduce((a, b) => a + b, 0) / chunks.length;
const maxLength = Math.max(...chunkLengths);
const minLength = Math.min(...chunkLengths);

console.log(`Avg: ${avgLength}, Min: ${minLength}, Max: ${maxLength}`);

RAG Debugging

Inspect how a document was chunked for troubleshooting:
const { chunks } = await fetch(
  `/api/get-document-content?id=${docId}`
).then(r => r.json());

// Find chunks containing specific keywords
const keyword = 'pricing';
const relevantChunks = chunks.filter(c => 
  c.chunk_text.toLowerCase().includes(keyword)
);

console.log(`Found "${keyword}" in ${relevantChunks.length} chunks:`);
relevantChunks.forEach(c => {
  console.log(`  Chunk ${c.chunk_index}: ${c.chunk_text.substring(0, 100)}...`);
});

Export to Text File

Export document content as plain text:
const { content } = await fetch(
  `/api/get-document-content?id=${docId}`
).then(r => r.json());

const blob = new Blob([content], { type: 'text/plain' });
const url = URL.createObjectURL(blob);

const a = document.createElement('a');
a.href = url;
a.download = 'document-content.txt';
a.click();

Search Within Document

Search for text within a specific document:
async function searchInDocument(docId, searchTerm) {
  const { chunks } = await fetch(
    `/api/get-document-content?id=${docId}`
  ).then(r => r.json());
  
  const results = chunks
    .map(chunk => ({
      index: chunk.chunk_index,
      text: chunk.chunk_text,
      matches: (chunk.chunk_text.match(
        new RegExp(searchTerm, 'gi')
      ) || []).length
    }))
    .filter(r => r.matches > 0)
    .sort((a, b) => b.matches - a.matches);
  
  return results;
}

const results = await searchInDocument(42, 'customer service');
console.log(`Found ${results.length} chunks with matches`);

Chunk Structure

Each chunk in the response contains:
  • chunk_text: The actual text content extracted from the document
  • chunk_index: Zero-based position in the document (0, 1, 2, …)
Chunks are created during the upload process with configurable size and overlap:
  • Chunk Size: Typically 500-1000 tokens (configured via rag.chunk_size)
  • Chunk Overlap: Typically 50-200 tokens (configured via rag.chunk_overlap)
Overlap ensures semantic continuity between chunks for better RAG retrieval.

Build docs developers (and LLMs) love