Skip to main content
Readers ingest data from various sources and convert it into Document objects that LlamaIndex can process. LlamaIndex.TS provides built-in readers for common file formats and integrations.

Overview

Readers implement the BaseReader interface:
interface BaseReader {
  loadData(...args: unknown[]): Promise<Document[]>;
}
All readers convert their input format into one or more Document objects with text content and metadata.

File Readers

SimpleDirectoryReader

Load multiple file types from a directory:
import { SimpleDirectoryReader } from "@llamaindex/readers/directory";

const reader = new SimpleDirectoryReader();
const documents = await reader.loadData("./data");

console.log(`Loaded ${documents.length} documents`);
Supported formats: TXT, PDF, CSV, Markdown, DOCX, HTML, JPG/PNG/GIF, XML

Custom File Extensions

import { SimpleDirectoryReader } from "@llamaindex/readers/directory";
import { JSONReader } from "@llamaindex/readers/json";

const reader = new SimpleDirectoryReader();
const documents = await reader.loadData({
  directoryPath: "./data",
  fileExtToReader: {
    json: new JSONReader(),
    // Add custom readers for other extensions
  }
});

PDF Reader

import { PDFReader } from "@llamaindex/readers/pdf";

const reader = new PDFReader();
const documents = await reader.loadData("document.pdf");

// Each page becomes a separate document
for (const doc of documents) {
  console.log(doc.metadata.page_number);
}

DOCX Reader

import { DocxReader } from "@llamaindex/readers/docx";

const reader = new DocxReader();
const documents = await reader.loadData("document.docx");

CSV Reader

import { CSVReader } from "@llamaindex/readers/csv";

// Concatenate all rows into one document
const reader = new CSVReader(
  true,      // concatRows
  ", ",      // colJoiner
  "\n"       // rowJoiner
);

const documents = await reader.loadData("data.csv");

// Or create one document per row
const rowReader = new CSVReader(false);
const rowDocuments = await rowReader.loadData("data.csv");

Markdown Reader

import { MarkdownReader } from "@llamaindex/readers/markdown";

const reader = new MarkdownReader(
  true,  // removeHyperlinks
  true   // removeImages
);

const documents = await reader.loadData("README.md");

// Documents are split by headers
for (const doc of documents) {
  console.log(doc.text);
}

HTML Reader

import { HTMLReader } from "@llamaindex/readers/html";

const reader = new HTMLReader();
const documents = await reader.loadData("page.html");

JSON Reader

import { JSONReader } from "@llamaindex/readers/json";

const reader = new JSONReader();
const documents = await reader.loadData("data.json");

Image Reader

import { ImageReader } from "@llamaindex/readers/image";

const reader = new ImageReader();
const imageDocuments = await reader.loadData("photo.jpg");

// Creates ImageDocument with image blob

Text File Reader

import { TextFileReader } from "@llamaindex/readers/text";

const reader = new TextFileReader();
const documents = await reader.loadData("file.txt");

XML Reader

import { XMLReader } from "@llamaindex/readers/xml";

const reader = new XMLReader();
const documents = await reader.loadData("data.xml");

LlamaParse

LlamaParse is a premium document parsing service that handles complex layouts, tables, and figures:
import { LlamaParseReader } from "llamaindex";

const reader = new LlamaParseReader({
  apiKey: process.env.LLAMA_CLOUD_API_KEY,
  resultType: "markdown",  // or "text"
  language: "en"
});

const documents = await reader.loadData("complex-document.pdf");

Features

  • Advanced PDF parsing: Tables, charts, multi-column layouts
  • Image extraction: Embedded images and figures
  • Format preservation: Maintains document structure
  • Multiple formats: PDF, DOCX, PPTX, and more

Configuration

const reader = new LlamaParseReader({
  apiKey: process.env.LLAMA_CLOUD_API_KEY,
  resultType: "markdown",
  numWorkers: 4,
  verbose: true,
  language: "en",
  // Advanced options
  parsingInstructions: "Focus on extracting tables",
  skipDiagonalText: false,
  invalidateCache: false,
  doNotCache: false,
  fastMode: false
});

Platform Integrations

Notion Reader

import { NotionReader } from "@llamaindex/notion";

const reader = new NotionReader({
  auth: process.env.NOTION_TOKEN
});

const documents = await reader.loadData({
  databaseId: "your-database-id"
});

Discord Reader

import { DiscordReader } from "@llamaindex/discord";

const reader = new DiscordReader({
  token: process.env.DISCORD_TOKEN
});

const documents = await reader.loadData({
  channelId: "channel-id",
  limit: 100
});

AssemblyAI Reader

Transcribe audio/video files:
import { AssemblyAIReader } from "@llamaindex/assemblyai";

const reader = new AssemblyAIReader({
  apiKey: process.env.ASSEMBLYAI_API_KEY
});

const documents = await reader.loadData("podcast.mp3");

Loading from URLs

Many readers support loading from HTTP/HTTPS URLs:
import { PDFReader } from "@llamaindex/readers/pdf";

const reader = new PDFReader();
const documents = await reader.loadData(
  "https://example.com/document.pdf"
);

Custom Readers

Create your own reader by implementing BaseReader:
import { BaseReader, Document } from "llamaindex";

class CustomAPIReader implements BaseReader {
  constructor(private apiKey: string) {}
  
  async loadData(endpoint: string): Promise<Document[]> {
    // Fetch data from your API
    const response = await fetch(endpoint, {
      headers: {
        Authorization: `Bearer ${this.apiKey}`
      }
    });
    
    const data = await response.json();
    
    // Convert to Documents
    return data.items.map((item: any) => 
      new Document({
        text: item.content,
        metadata: {
          id: item.id,
          title: item.title,
          date: item.created_at
        }
      })
    );
  }
}

const reader = new CustomAPIReader(process.env.API_KEY!);
const documents = await reader.loadData("https://api.example.com/items");

Extending FileReader

For file-based readers, extend FileReader:
import { FileReader, Document } from "@llamaindex/core/schema";

class CustomFileReader extends FileReader {
  async loadDataAsContent(
    fileContent: Uint8Array,
    filename?: string
  ): Promise<Document[]> {
    // Parse file content
    const text = new TextDecoder().decode(fileContent);
    
    // Custom parsing logic
    const sections = this.parseCustomFormat(text);
    
    // Return documents
    return sections.map(section => 
      new Document({
        text: section.content,
        metadata: {
          filename,
          section: section.name
        }
      })
    );
  }
  
  private parseCustomFormat(text: string) {
    // Your parsing logic
    return [];
  }
}

const reader = new CustomFileReader();
const documents = await reader.loadData("file.custom");

Complete Example

import { 
  VectorStoreIndex,
  IngestionPipeline,
  SentenceSplitter
} from "llamaindex";
import { SimpleDirectoryReader } from "@llamaindex/readers/directory";
import { OpenAIEmbedding } from "@llamaindex/openai";
import { PDFReader } from "@llamaindex/readers/pdf";
import { MarkdownReader } from "@llamaindex/readers/markdown";

async function main() {
  // Load documents from directory
  const reader = new SimpleDirectoryReader();
  const documents = await reader.loadData({
    directoryPath: "./data",
    fileExtToReader: {
      pdf: new PDFReader(),
      md: new MarkdownReader()
    }
  });
  
  console.log(`Loaded ${documents.length} documents`);
  
  // Inspect documents
  for (const doc of documents.slice(0, 3)) {
    console.log("File:", doc.metadata.file_name);
    console.log("Preview:", doc.text.substring(0, 100));
  }
  
  // Process with pipeline
  const pipeline = new IngestionPipeline({
    transformations: [
      new SentenceSplitter({ chunkSize: 1024 }),
      new OpenAIEmbedding()
    ]
  });
  
  const nodes = await pipeline.run({ documents });
  
  // Create index
  const index = await VectorStoreIndex.init({ nodes });
  
  // Query
  const queryEngine = index.asQueryEngine();
  const response = await queryEngine.query({
    query: "What are the main topics across all documents?"
  });
  
  console.log(response.toString());
}

main().catch(console.error);

Available Reader Packages

Core Readers

@llamaindex/readers
  • SimpleDirectoryReader
  • PDFReader
  • CSVReader
  • MarkdownReader
  • DocxReader
  • HTMLReader
  • JSONReader
  • ImageReader
  • TextFileReader
  • XMLReader

Platform Integrations

  • @llamaindex/notion - Notion databases
  • @llamaindex/discord - Discord channels
  • @llamaindex/assemblyai - Audio/video transcription

Premium Services

  • LlamaParse - Advanced document parsing
  • LlamaCloud - Managed data ingestion

Community

Check the LlamaIndex Hub for community-contributed readers:
  • Web scrapers
  • Database connectors
  • API integrations
  • And more

Best Practices

  1. Choose the right reader
    • Use format-specific readers for better parsing
    • LlamaParse for complex PDFs with tables
    • SimpleDirectoryReader for mixed formats
  2. Handle metadata
    • Readers automatically add file paths and names
    • Preserve source information for citations
    • Add custom metadata after loading
  3. Process in batches
    • Load files in chunks for large datasets
    • Monitor memory usage
    • Use streaming when possible
  4. Error handling
    • Catch and log file-specific errors
    • Continue processing other files on failure
    • Validate file formats before reading
  5. Combine with pipelines
    • Use readers with IngestionPipeline
    • Chain transformations after reading
    • Cache results for repeated access

Next Steps

Documents

Work with Document objects

Ingestion

Build data processing pipelines

Node Parsers

Split documents into chunks

LlamaParse

Advanced document parsing

Build docs developers (and LLMs) love