cog serve

Run a prediction HTTP server. Builds the model and starts an HTTP server that exposes the model’s inputs and outputs as a REST API. Compatible with the Cog HTTP protocol.

Usage

cog serve [flags]

Flags

-p, --port

integer

default:"8393"

Port on which to listen

cog serve -p 5000

-f, --file

string

default:"cog.yaml"

The name of the config file

cog serve -f custom-config.yaml

--gpus

string

GPU devices to add to the container, in the same format as docker run --gpus

cog serve --gpus all

--upload-url

string

Upload URL for file outputs (e.g., https://example.com/upload/). When specified, the server uploads file outputs to this URL instead of returning them directly.

cog serve --upload-url https://example.com/upload/

--progress

string

default:"auto"

Set type of build progress output: auto, tty, plain, or quiet

--use-cog-base-image

boolean

default:"true"

Use pre-built Cog base image for faster cold boots

--use-cuda-base-image

string

default:"auto"

Use Nvidia CUDA base image: true, false, or auto

Examples

Start the server on default port

cog serve

Output:

Building Docker image from environment in cog.yaml...

[+] Building 2.1s (12/12) FINISHED

Running 'python --check-hash-based-pycs never -m cog.server.http --await-explicit-shutdown true' in Docker with the current directory mounted as a volume...

Serving at http://127.0.0.1:8393

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5000

The server is now running and ready to accept prediction requests.

Start on a custom port

cog serve -p 5000

Access the server at http://localhost:5000.

Test the server

Make a prediction request:

curl http://localhost:8393/predictions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"input": {"prompt": "a cat"}}'

Response:

{
  "id": "abc123",
  "status": "succeeded",
  "input": {
    "prompt": "a cat"
  },
  "output": "A fluffy orange tabby cat sitting on a windowsill",
  "logs": "",
  "error": null,
  "created_at": "2024-01-15T10:30:00.000Z",
  "started_at": "2024-01-15T10:30:00.100Z",
  "completed_at": "2024-01-15T10:30:02.500Z"
}

Send a file input

curl http://localhost:8393/predictions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"input": {"image": "https://example.com/photo.jpg"}}'

Or with a base64-encoded file:

curl http://localhost:8393/predictions \
  -X POST \
  -H 'Content-Type: application/json' \
  -d @- <<EOF
{
  "input": {
    "image": "data:image/jpeg;base64,$(base64 -i photo.jpg)"
  }
}
EOF

Check server health

curl http://localhost:8393/health-check

Response:

{"status": "READY"}

Get OpenAPI schema

curl http://localhost:8393/openapi.json

Returns the full OpenAPI specification for your model’s API.

API Endpoints

The server exposes these endpoints:

POST /predictions

Create a prediction. Request:

{
  "input": {
    "prompt": "a photo of a cat",
    "width": 1024,
    "height": 768
  }
}

Response:

{
  "id": "abc123",
  "status": "succeeded",
  "input": {...},
  "output": "...",
  "logs": "",
  "error": null,
  "created_at": "2024-01-15T10:30:00.000Z",
  "started_at": "2024-01-15T10:30:00.100Z",
  "completed_at": "2024-01-15T10:30:02.500Z"
}

GET /health-check

Check if the server is ready. Response:

{"status": "READY"}

GET /openapi.json

Get the OpenAPI schema. Response: Full OpenAPI 3.0 specification

POST /shutdown

Shutdown the server gracefully.

Input Types

The server handles various input types:

Strings

{"input": {"prompt": "hello world"}}

Numbers

{"input": {"width": 1024, "temperature": 0.8}}

Booleans

{"input": {"use_refiner": true}}

Files (URLs)

{"input": {"image": "https://example.com/photo.jpg"}}

Files (base64 data URLs)

{"input": {"image": "data:image/jpeg;base64,/9j/4AAQ..."}}

Arrays

{"input": {"images": [
  "https://example.com/photo1.jpg",
  "https://example.com/photo2.jpg"
]}}

Output Types

Strings

{"output": "Generated text"}

Numbers/Booleans

{"output": 42}
{"output": true}

Files

Returned as data URLs:

{"output": "data:image/png;base64,iVBORw0KGgo..."}

With --upload-url, files are uploaded and returned as URLs:

{"output": "https://example.com/outputs/abc123.png"}

Arrays

{"output": [
  "data:image/png;base64,iVBORw0KGgo...",
  "data:image/png;base64,iVBORw0KGgo..."
]}

Objects

{"output": {
  "result": "success",
  "score": 0.95
}}

Error Handling

When predictions fail, the response includes error details:

{
  "id": "abc123",
  "status": "failed",
  "error": "Invalid input: prompt must not be empty",
  "logs": "...",
  "created_at": "2024-01-15T10:30:00.000Z",
  "started_at": "2024-01-15T10:30:00.100Z",
  "completed_at": "2024-01-15T10:30:00.200Z"
}

File Upload Configuration

Default behavior

By default, file outputs are returned as base64 data URLs in the response:

cog serve

With upload URL

Files are uploaded to the specified URL:

cog serve --upload-url https://example.com/upload/

The server:

Generates the output file
POSTs it to the upload URL
Returns the URL in the response

This is useful for:

Large files that exceed response size limits
External storage systems
CDN integration

GPU Configuration

Cog automatically detects GPU requirements:

# Auto-detected from cog.yaml
cog serve

# Explicitly use all GPUs
cog serve --gpus all

# Use specific GPUs
cog serve --gpus '"device=0,1"'

# Disable GPU
cog serve --gpus ""

Development Workflow

Local development

Start the server:
```
cog serve
```
Make changes to your code
Restart the server (Ctrl+C, then cog serve again)
Test with curl or your application

Hot reloading

The current directory is mounted as a volume, so you can:

Edit Python files
Restart the server to pick up changes
No need to rebuild the Docker image

Integration Examples

Python client

import requests

response = requests.post(
    "http://localhost:8393/predictions",
    json={"input": {"prompt": "a cat"}}
)
print(response.json())

JavaScript client

const response = await fetch('http://localhost:8393/predictions', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({
    input: {prompt: 'a cat'}
  })
});
const prediction = await response.json();
console.log(prediction);

Using the Replicate client

import replicate

# Point to local server
replicate.Client(api_url="http://localhost:8393")

output = replicate.run(
    "local",
    input={"prompt": "a cat"}
)

Logs

Server logs include:

Request details
Prediction timing
Your model’s print statements
Error traces

View logs in the terminal where you ran cog serve.

Shutdown

Gracefully shutdown the server:

# From the terminal
Ctrl+C

Or programmatically:

curl -X POST http://localhost:8393/shutdown

How It Works

Build phase:
- Reads cog.yaml
- Builds a Docker image
- Mounts current directory as a volume
Server startup:
- Starts the Cog HTTP server (Rust/Axum)
- Runs your model’s setup() method
- Begins listening on specified port
Prediction handling:
- Receives HTTP requests
- Validates inputs against schema
- Runs your predict() method
- Returns formatted output
Shutdown:
- Handles graceful shutdown
- Cleans up resources

Performance

The Cog HTTP server is built with Rust for high performance:

Low latency request handling
Efficient memory usage
Automatic request queuing
WebSocket support for streaming

CLI Commands

Python SDK

HTTP API

Redis Queue

​Usage

​Flags

​Examples

​Start the server on default port

​Start on a custom port

​Test the server

​Send a file input

​Check server health

​Get OpenAPI schema

​API Endpoints

​POST /predictions

​GET /health-check

​GET /openapi.json

​POST /shutdown

​Input Types

​Strings

​Numbers

​Booleans

​Files (URLs)

​Files (base64 data URLs)

​Arrays

​Output Types

​Strings

​Numbers/Booleans

​Files

​Arrays

​Objects

​Error Handling

​File Upload Configuration

​Default behavior

​With upload URL

​GPU Configuration

​Development Workflow

​Local development

​Hot reloading

​Integration Examples

​Python client

​JavaScript client

​Using the Replicate client

​Logs

​Shutdown

​How It Works

​Performance

​See Also

Build docs developers (and LLMs) love

Usage

Flags

Examples

Start the server on default port

Start on a custom port

Test the server

Send a file input

Check server health

Get OpenAPI schema

API Endpoints

POST /predictions

GET /health-check

GET /openapi.json

POST /shutdown

Input Types

Strings

Numbers

Booleans

Files (URLs)

Files (base64 data URLs)

Arrays

Output Types

Strings

Numbers/Booleans

Files

Arrays

Objects

Error Handling

File Upload Configuration

Default behavior

With upload URL

GPU Configuration

Development Workflow

Local development

Hot reloading

Integration Examples

Python client

JavaScript client

Using the Replicate client

Logs

Shutdown

How It Works

Performance

See Also