Fireworks AI

Overview

Fireworks AI provides fast inference for open-source LLMs and embedding models. The provider extends OpenAI’s interface with Fireworks AI’s API endpoints.

Installation

npm install @llamaindex/fireworks

Basic Usage

LLM

import { FireworksLLM } from "@llamaindex/fireworks";

const llm = new FireworksLLM({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  apiKey: process.env.FIREWORKS_API_KEY
});

const response = await llm.chat({
  messages: [
    { role: "user", content: "Explain quantum computing" }
  ]
});

console.log(response.message.content);

Embeddings

import { FireworksEmbedding } from "@llamaindex/fireworks";

const embedModel = new FireworksEmbedding({
  model: "nomic-ai/nomic-embed-text-v1.5",
  apiKey: process.env.FIREWORKS_API_KEY
});

const embedding = await embedModel.getTextEmbedding(
  "LlamaIndex is a data framework for LLM applications"
);

Constructor Options

FireworksLLM

model

string

default:"accounts/fireworks/models/mixtral-8x7b-instruct"

Fireworks AI model name

apiKey

string

Fireworks API key (defaults to FIREWORKS_API_KEY env variable)

temperature

number

Sampling temperature

maxTokens

number

Maximum tokens in response

topP

number

Nucleus sampling parameter

additionalSessionOptions

object

Additional OpenAI client options (e.g., custom baseURL)

FireworksEmbedding

model

string

default:"nomic-ai/nomic-embed-text-v1.5"

Fireworks AI embedding model name

apiKey

string

Fireworks API key (defaults to FIREWORKS_API_KEY env variable)

additionalSessionOptions

object

Additional OpenAI client options

Supported Models

Chat Models

Llama 3.1

accounts/fireworks/models/llama-v3p1-405b-instruct: 405B, most capable
accounts/fireworks/models/llama-v3p1-70b-instruct: 70B, balanced
accounts/fireworks/models/llama-v3p1-8b-instruct: 8B, fast

Llama 3

accounts/fireworks/models/llama-v3-70b-instruct
accounts/fireworks/models/llama-v3-8b-instruct

Mixtral

accounts/fireworks/models/mixtral-8x7b-instruct: Default model
accounts/fireworks/models/mixtral-8x22b-instruct

Qwen

accounts/fireworks/models/qwen2p5-72b-instruct
accounts/fireworks/models/qwen2p5-7b-instruct

DeepSeek

accounts/fireworks/models/deepseek-v3

Embedding Models

nomic-ai/nomic-embed-text-v1.5: Default, 768 dimensions
nomic-ai/nomic-embed-text-v1: 768 dimensions
WhereIsAI/UAE-Large-V1: 1024 dimensions
thenlper/gte-large: 1024 dimensions

Streaming

const stream = await llm.chat({
  messages: [{ role: "user", content: "Write a story about AI" }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.delta);
}

Function Calling

Fireworks AI supports function calling on compatible models:

import { tool } from "@llamaindex/core/tools";
import { z } from "zod";

const weatherTool = tool({
  name: "get_weather",
  description: "Get current weather",
  parameters: z.object({
    location: z.string(),
    units: z.enum(["celsius", "fahrenheit"]).optional()
  }),
  execute: async ({ location, units = "celsius" }) => {
    return `Weather in ${location}: 22°${units === "celsius" ? "C" : "F"}`;
  }
});

const llm = new FireworksLLM({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct"
});

const response = await llm.chat({
  messages: [{ role: "user", content: "What's the weather in Tokyo?" }],
  tools: [weatherTool]
});

Structured Output

import { z } from "zod";

const schema = z.object({
  name: z.string(),
  age: z.number(),
  interests: z.array(z.string())
});

const result = await llm.exec({
  messages: [{ role: "user", content: "Extract info: John is 30 and likes coding, hiking" }],
  responseFormat: schema
});

With LlamaIndex

import { Settings, VectorStoreIndex } from "llamaindex";
import { FireworksLLM, FireworksEmbedding } from "@llamaindex/fireworks";

Settings.llm = new FireworksLLM({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct"
});

Settings.embedModel = new FireworksEmbedding({
  model: "nomic-ai/nomic-embed-text-v1.5"
});

const index = await VectorStoreIndex.fromDocuments(documents);
const queryEngine = index.asQueryEngine();

const response = await queryEngine.query({
  query: "What are the main features?"
});

Convenience Functions

import { fireworks } from "@llamaindex/fireworks";

const llm = fireworks({
  model: "accounts/fireworks/models/llama-v3p1-8b-instruct"
});

Configuration

Environment Variables

FIREWORKS_API_KEY=fw_...

Custom Base URL

const llm = new FireworksLLM({
  additionalSessionOptions: {
    baseURL: "https://custom-fireworks-endpoint.com/inference/v1"
  }
});

Default base URL: https://api.fireworks.ai/inference/v1

Global Settings

import { Settings } from "llamaindex";
import { FireworksLLM } from "@llamaindex/fireworks";

Settings.llm = new FireworksLLM({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct"
});

Model Selection Guide

Use Case	Recommended Model	Why
Best quality	llama-v3p1-405b-instruct	Most capable
Balanced	llama-v3p1-70b-instruct	Good quality, fast
Speed critical	llama-v3p1-8b-instruct	Fastest
MoE architecture	mixtral-8x22b-instruct	Efficient, capable
Embeddings	nomic-embed-text-v1.5	High quality, latest

Performance

Fireworks AI optimizes for low latency:

Fast inference: Optimized model serving
Batch processing: Efficient for high throughput
Streaming: Real-time token generation
Global deployment: Low latency worldwide

const startTime = Date.now();

const response = await llm.chat({
  messages: [{ role: "user", content: "Explain AI" }]
});

const duration = Date.now() - startTime;
console.log(`Response time: ${duration}ms`);

Error Handling

try {
  const response = await llm.chat({ messages });
} catch (error) {
  if (error.message.includes("FIREWORKS_API_KEY")) {
    console.error("API key not set or invalid");
  } else if (error.status === 429) {
    console.error("Rate limit exceeded");
  } else {
    console.error("API error:", error.message);
  }
}

Rate Limits

Fireworks AI has different rate limits based on your plan:

try {
  const response = await llm.chat({ messages });
} catch (error) {
  if (error.status === 429) {
    console.log("Rate limit hit, waiting...");
    await new Promise(resolve => setTimeout(resolve, 1000));
    // Retry logic
  }
}

Best Practices

Choose right model: Balance quality vs. speed and cost
Use streaming: Better UX for chat applications
Enable function calling: For structured interactions
Monitor performance: Track latency and costs
Set appropriate tokens: Control response length
Use embeddings: nomic-embed-text-v1.5 for RAG applications

Example: RAG Application

import { VectorStoreIndex, Settings } from "llamaindex";
import { FireworksLLM, FireworksEmbedding } from "@llamaindex/fireworks";

// Configure both LLM and embeddings
Settings.llm = new FireworksLLM({
  model: "accounts/fireworks/models/llama-v3p1-70b-instruct",
  temperature: 0.1
});

Settings.embedModel = new FireworksEmbedding({
  model: "nomic-ai/nomic-embed-text-v1.5"
});

// Build index
const index = await VectorStoreIndex.fromDocuments(documents);

// Query
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query({
  query: "What are the key insights?"
});

Pricing

Fireworks AI offers competitive pricing for open-source models. Check Fireworks AI pricing for current rates.

Core Package

Main Package

LLM Providers

Vector Stores

Workflow & Tools

Overview

Installation

Basic Usage

LLM

Embeddings

Constructor Options

FireworksLLM

FireworksEmbedding

Supported Models

Chat Models

Llama 3.1

Llama 3

Mixtral

Qwen

DeepSeek

Embedding Models

Streaming

Function Calling

Structured Output

With LlamaIndex

Convenience Functions

Configuration

Environment Variables

Custom Base URL

Global Settings

Model Selection Guide

Performance

Error Handling

Rate Limits

Best Practices

Example: RAG Application

Pricing

See Also

Build docs developers (and LLMs) love

Core Package

Main Package

LLM Providers

Vector Stores

Workflow & Tools

​Overview

​Installation

​Basic Usage

​LLM

​Embeddings

​Constructor Options

​FireworksLLM

​FireworksEmbedding

​Supported Models

​Chat Models

​Llama 3.1

​Llama 3

​Mixtral

​Qwen

​DeepSeek

​Embedding Models

​Streaming

​Function Calling

​Structured Output

​With LlamaIndex

​Convenience Functions

​Configuration

​Environment Variables

​Custom Base URL

​Global Settings

​Model Selection Guide

​Performance

​Error Handling

​Rate Limits

​Best Practices

​Example: RAG Application

​Pricing

​See Also

Build docs developers (and LLMs) love

Overview

Installation

Basic Usage

LLM

Embeddings

Constructor Options

FireworksLLM

FireworksEmbedding

Supported Models

Chat Models

Llama 3.1

Llama 3

Mixtral

Qwen

DeepSeek

Embedding Models

Streaming

Function Calling

Structured Output

With LlamaIndex

Convenience Functions

Configuration

Environment Variables

Custom Base URL

Global Settings

Model Selection Guide

Performance

Error Handling

Rate Limits

Best Practices

Example: RAG Application

Pricing

See Also