Skip to main content
Cloudflare Workers AI provides free access to various open-source models running on Cloudflare’s edge network.

Overview

Cloudflare Workers AI offers serverless inference for multiple open-source models, running on Cloudflare’s global network for low latency worldwide.

Rate Limits

Free Allocation: 10,000 neurons per day
View detailed pricing
Neurons are Cloudflare’s unit of measurement for AI workloads. Different models and operations consume different amounts of neurons.

Available Models

Cloudflare Workers AI offers a wide variety of models:
  • @cf/openai/gpt-oss-120b - Open-source GPT model
  • @cf/qwen/qwen3-30b-a3b-fp8 - Qwen 3 model
  • Llama 4 Scout Instruct - Latest Llama model
  • Llama 3.3 70B Instruct (FP8) - Optimized Llama 3.3
  • Gemma 3 12B Instruct - Google’s Gemma model
  • Mistral Small 3.1 24B Instruct - Mistral’s efficient model

DeepSeek Models

  • DeepSeek R1 Distill Qwen 32B
  • Deepseek Coder 6.7B Base (AWQ)
  • Deepseek Coder 6.7B Instruct (AWQ)
  • Deepseek Math 7B Instruct

Llama Models

  • Llama 2 7B Chat (FP16, INT8, LoRA)
  • Llama 2 13B Chat (AWQ)
  • Llama 3 8B Instruct (AWQ)
  • Llama 3.1 8B Instruct (AWQ, FP8)
  • Llama 3.2 1B, 3B, 11B Vision Instruct
  • Llama 3.3 70B Instruct (FP8)
  • Llama 4 Scout Instruct
  • Llama Guard 3 8B

Mistral Models

  • Mistral 7B Instruct v0.1 (AWQ)
  • Mistral 7B Instruct v0.2 (LoRA)
  • Mistral Small 3.1 24B Instruct
  • Hermes 2 Pro Mistral 7B

Qwen Models

  • Qwen 1.5 (0.5B, 1.8B, 7B, 14B)
  • Qwen 2.5 Coder 32B Instruct
  • Qwen QwQ 32B

Gemma Models

  • Gemma 2B Instruct (LoRA)
  • Gemma 3 12B Instruct
  • Gemma 7B Instruct (LoRA)

Other Models

  • @cf/aisingapore/gemma-sea-lion-v4-27b-it
  • @cf/ibm-granite/granite-4.0-h-micro
  • @cf/zai-org/glm-4.7-flash
  • Discolm German 7B v1 (AWQ)
  • Falcom 7B Instruct
  • Neural Chat 7B v3.1 (AWQ)
  • OpenChat 3.5 0106
  • OpenHermes 2.5 Mistral 7B (AWQ)
  • Phi-2
  • SQLCoder 7B 2
  • Starling LM 7B Beta
  • TinyLlama 1.1B Chat v1.0
  • Una Cybertron 7B v2 (BF16)
  • Zephyr 7B Beta (AWQ)

API Usage

import requests

url = "https://api.cloudflare.com/client/v4/accounts/YOUR_ACCOUNT_ID/ai/run/@cf/meta/llama-3.3-70b-instruct-fp8"

headers = {
    "Authorization": "Bearer YOUR_CLOUDFLARE_API_TOKEN",
    "Content-Type": "application/json"
}

data = {
    "messages": [
        {"role": "user", "content": "Hello, how are you?"}
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Getting Started

1

Create Cloudflare Account

Sign up at cloudflare.com
2

Get API Token

Create an API token with Workers AI permissions
3

Find Account ID

Get your account ID from the dashboard
4

Start Building

Use the REST API to run inference

Key Features

Global Network

Run inference on Cloudflare’s edge network

Low Latency

Models run close to your users

Multiple Models

50+ open-source models available

Optimized Variants

AWQ, LoRA, FP8, INT8 quantized models

Serverless

No infrastructure to manage

Pay As You Go

Free tier with neurons-based pricing

Model Optimizations

Cloudflare offers various optimized versions:
  • AWQ: Activation-aware Weight Quantization (4-bit)
  • FP8: 8-bit floating point
  • INT8: 8-bit integer quantization
  • LoRA: Low-Rank Adaptation for fine-tuning

Use Cases

  • Edge AI: Run AI close to your users
  • Chatbots: Build conversational interfaces
  • Content Generation: Generate text content
  • Code Assistance: Code completion and generation
  • Translation: Multilingual applications
  • Global Applications: Low-latency worldwide

Additional Resources

Cloudflare Dashboard

Manage your account

Documentation

Official documentation

Model Catalog

Browse all models

Pricing

Pricing details

Build docs developers (and LLMs) love