Replicate

Overview

Replicate provides on-demand inference for open-source LLMs including Llama, Mistral, and other models. The provider implements custom prompt formatting for different model families.

Installation

npm install @llamaindex/replicate

Basic Usage

import { ReplicateLLM } from "@llamaindex/replicate";

const llm = new ReplicateLLM({
  model: "llama-3-70b-instruct",
  temperature: 0.7
});

const response = await llm.chat({
  messages: [
    { role: "user", content: "Explain neural networks" }
  ]
});

console.log(response.message.content);

Constructor Options

model

keyof typeof ALL_AVAILABLE_REPLICATE_MODELS

default:"llama-3-70b-instruct"

Replicate model name from available models

temperature

number

Sampling temperature (minimum 0.01 for Replicate)

topP

number

default:1

Nucleus sampling parameter (Llama 3 defaults to 0.9)

maxTokens

number

Maximum tokens in response (defaults to model’s context window)

chatStrategy

ReplicateChatStrategy

Chat prompt formatting strategy (auto-detected from model)

replicateSession

ReplicateSession

Custom Replicate session with API token

noWarn

boolean

Suppress default model warning

Supported Models

Llama 3

llama-3-70b-instruct: 70B parameter, 8K context (default)
llama-3-8b-instruct: 8B parameter, 8K context

Llama 2

Llama-2-70b-chat-4bit: 70B 4-bit quantized, 4K context
Llama-2-70b-chat-old: 70B old version, 4K context
Llama-2-13b-chat-4bit: 13B 4-bit quantized, 4K context
Llama-2-13b-chat-old: 13B old version, 4K context
Llama-2-7b-chat-4bit: 7B 4-bit quantized, 4K context
Llama-2-7b-chat-old: 7B old version, 4K context

Chat Strategies

Replicate uses different prompt formats for different model versions:

import { ReplicateChatStrategy } from "@llamaindex/replicate";

const llm = new ReplicateLLM({
  model: "llama-3-70b-instruct",
  chatStrategy: ReplicateChatStrategy.LLAMA3  // Auto-detected
});

Available strategies:

LLAMA3: Llama 3 format with special tokens
META: Standard Llama 2 format
METAWBOS: Llama 2 with BOS/EOS tokens
A16Z: A16Z-Infra format
REPLICATE4BIT: 4-bit model format
REPLICATE4BITWNEWLINES: 4-bit format with newlines

Streaming

const stream = await llm.chat({
  messages: [{ role: "user", content: "Write a story about space exploration" }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.delta);
}

Custom API Token

import { ReplicateSession } from "@llamaindex/replicate";

const session = new ReplicateSession("your-api-token");

const llm = new ReplicateLLM({
  model: "llama-3-70b-instruct",
  replicateSession: session
});

With LlamaIndex

import { Settings, VectorStoreIndex } from "llamaindex";
import { ReplicateLLM } from "@llamaindex/replicate";

Settings.llm = new ReplicateLLM({
  model: "llama-3-70b-instruct",
  temperature: 0.1
});

const index = await VectorStoreIndex.fromDocuments(documents);
const queryEngine = index.asQueryEngine();

const response = await queryEngine.query({
  query: "Summarize the main points"
});

Convenience Function

import { replicate } from "@llamaindex/replicate";

const llm = replicate({
  model: "llama-3-8b-instruct",
  noWarn: true
});

Configuration

Environment Variables

REPLICATE_API_TOKEN=your-api-token-here

Global Settings

import { Settings } from "llamaindex";
import { ReplicateLLM } from "@llamaindex/replicate";

Settings.llm = new ReplicateLLM({
  model: "llama-3-70b-instruct"
});

Model Metadata

Access model information:

const llm = new ReplicateLLM({ model: "llama-3-70b-instruct" });

console.log(llm.metadata);
// {
//   model: "llama-3-70b-instruct",
//   temperature: 0.1,
//   topP: 0.9,
//   maxTokens: 8192,
//   contextWindow: 8192,
//   tokenizer: undefined,
//   structuredOutput: false
// }

System Messages

System messages are handled differently based on chat strategy:

// Llama 3 format
const response = await llm.chat({
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Hello!" }
  ]
});

Model Selection Guide

Use Case	Recommended Model	Why
Best quality	llama-3-70b-instruct	Latest, most capable
Balanced	llama-3-8b-instruct	Good quality, faster
Memory constrained	Llama-2-7b-chat-4bit	Quantized, smaller
Legacy systems	Llama-2-70b-chat-old	Stable, well-tested

Performance Considerations

Cold starts: First request may take longer
Streaming: Better UX for long responses
4-bit models: Faster but slightly lower quality
Context window: Llama 3 has 8K, Llama 2 has 4K

Error Handling

try {
  const response = await llm.chat({ messages });
} catch (error) {
  if (error.message.includes("REPLICATE_API_TOKEN")) {
    console.error("API token not set");
  } else {
    console.error("API error:", error.message);
  }
}

Legacy Aliases

For backwards compatibility:

import { LlamaDeuce, DeuceChatStrategy } from "@llamaindex/replicate";

// LlamaDeuce is an alias for ReplicateLLM
const llm = new LlamaDeuce({ model: "llama-3-70b-instruct" });

Advanced: Custom Prompt Formatting

The provider handles complex prompt formatting automatically:

// Llama 3 format example
// Input messages are converted to:
// <|begin_of_text|><|start_header_id|>user<|end_header_id|>
// message<|eot_id|><|start_header_id|>assistant<|end_header_id|>

// Llama 2 format example
// Input messages are converted to:
// <s>[INST] <<SYS>>system<</SYS>> message [/INST]

Best Practices

Use Llama 3 for new projects: Better quality and 8K context
Set appropriate maxTokens: Default can be high, adjust for your use case
Handle cold starts: First request takes longer, consider warming up
Use streaming: Better UX for chat applications
Choose right model size: Balance quality vs. cost and speed
Leverage system messages: Properly format instructions

Pricing

Replicate charges per second of compute time. Check Replicate pricing for current rates.

Core Package

Main Package

LLM Providers

Vector Stores

Workflow & Tools

Overview

Installation

Basic Usage

Constructor Options

Supported Models

Llama 3

Llama 2

Chat Strategies

Streaming

Custom API Token

With LlamaIndex

Convenience Function

Configuration

Environment Variables

Global Settings

Model Metadata

System Messages

Model Selection Guide

Performance Considerations

Error Handling

Legacy Aliases

Advanced: Custom Prompt Formatting

Best Practices

Pricing

See Also

Build docs developers (and LLMs) love

Core Package

Main Package

LLM Providers

Vector Stores

Workflow & Tools

​Overview

​Installation

​Basic Usage

​Constructor Options

​Supported Models

​Llama 3

​Llama 2

​Chat Strategies

​Streaming

​Custom API Token

​With LlamaIndex

​Convenience Function

​Configuration

​Environment Variables

​Global Settings

​Model Metadata

​System Messages

​Model Selection Guide

​Performance Considerations

​Error Handling

​Legacy Aliases

​Advanced: Custom Prompt Formatting

​Best Practices

​Pricing

​See Also

Build docs developers (and LLMs) love

Overview

Installation

Basic Usage

Constructor Options

Supported Models

Llama 3

Llama 2

Chat Strategies

Streaming

Custom API Token

With LlamaIndex

Convenience Function

Configuration

Environment Variables

Global Settings

Model Metadata

System Messages

Model Selection Guide

Performance Considerations

Error Handling

Legacy Aliases

Advanced: Custom Prompt Formatting

Best Practices

Pricing

See Also