HTTP API Overview

When you run a Docker image built by Cog, it serves an HTTP API for making predictions on your model.

Server Setup

First, build your model:

cog build -t my-model

Then, start the Docker container:

# If your model uses a CPU:
docker run -d -p 5001:5000 my-model

# If your model uses a GPU:
docker run -d -p 5001:5000 --gpus all my-model

# If you're on an M1 Mac:
docker run -d -p 5001:5000 --platform=linux/amd64 my-model

The server is now running locally on port 5001.

Making Predictions

To run a prediction on the model, call the /predictions endpoint:

curl http://localhost:5001/predictions -X POST \
    --header "Content-Type: application/json" \
    --data '{"input": {"image": "https://.../input.jpg"}}'

Synchronous vs Asynchronous Predictions

The server supports both synchronous and asynchronous prediction creation:

Synchronous: The server waits until the prediction is completed and responds with the result.
Asynchronous: The server immediately returns a response and processes the prediction in the background.

The client can create a prediction asynchronously by setting the Prefer: respond-async header in their request. When provided, the server responds immediately after starting the prediction with 202 Accepted status and a prediction object in status processing.

The only supported way to receive updates on the status of predictions started asynchronously is using webhooks. Polling for prediction status is not currently supported.

Idempotent Predictions

You can use certain server endpoints to create predictions idempotently, such that if a client calls this endpoint more than once with the same ID (for example, due to a network interruption) while the prediction is still running, no new prediction is created. Instead, the client receives a 202 Accepted response with the initial state of the prediction.

Endpoint Summary

Here’s a summary of the prediction creation endpoints:

Endpoint	Header	Behavior
`POST /predictions`	-	Synchronous, non-idempotent
`POST /predictions`	`Prefer: respond-async`	Asynchronous, non-idempotent
`PUT /predictions/<prediction_id>`	-	Synchronous, idempotent
`PUT /predictions/<prediction_id>`	`Prefer: respond-async`	Asynchronous, idempotent

Choose the endpoint that best fits your needs:

Use synchronous endpoints when you want to wait for the prediction result.
Use asynchronous endpoints when you want to start a prediction and receive updates via webhooks.
Use idempotent endpoints when you need to safely retry requests without creating duplicate predictions.

Server Options

Cog Docker images have python -m cog.server.http set as the default command. When using command-line options, you need to pass in the full command before the options.

—threads

This controls how many threads are used by Cog, which determines how many requests Cog serves in parallel. If your model uses a CPU, this is the number of CPUs on your machine. If your model uses a GPU, this is 1, because typically a GPU can only be used by one process. You might need to adjust this if you want to control how much memory your model uses, or other similar constraints.

docker run -d -p 5000:5000 my-model python -m cog.server.http --threads=10

—host

By default, Cog serves to 0.0.0.0. You can override this using the --host option. For example, to serve Cog on an IPv6 address:

docker run -d -p 5000:5000 my-model python -m cog.server.http --host="::"

Health Check Endpoint

GET /health-check

Returns the current health status of the model container. This endpoint always responds with 200 OK — check the status field in the response body to determine readiness.

curl http://localhost:5001/health-check

Response:

{
    "status": "READY",
    "setup": {
        "started_at": "2025-01-01T00:00:00.000000+00:00",
        "completed_at": "2025-01-01T00:00:05.000000+00:00",
        "status": "succeeded",
        "logs": ""
    },
    "version": {
        "coglet": "0.17.0",
        "cog": "0.14.0",
        "python": "3.12.0"
    }
}

status

string

required

One of the following values:

STARTING: The model’s setup() method is still running.
READY: The model is ready to accept predictions.
BUSY: The model is ready but all prediction slots are in use.
SETUP_FAILED: The model’s setup() method raised an exception.
DEFUNCT: The model encountered an unrecoverable error.
UNHEALTHY: The model is ready but a user-defined healthcheck() method returned False.

setup

object

Setup phase details (included once setup has started).

Show setup properties

started_at

string

ISO 8601 timestamp of when setup began.

completed_at

string

ISO 8601 timestamp of when setup finished (if complete).

status

string

One of starting, succeeded, or failed.

logs

string

Output captured during setup.

version

object

Runtime version information.

Show version properties

coglet

string

Coglet version.

cog

string

Cog Python SDK version (if available).

python

string

Python version (if available).

user_healthcheck_error

string

Error message from a user-defined healthcheck() method (if applicable).

OpenAPI Schema

GET /openapi.json

The OpenAPI specification of the API, which is derived from the input and output types specified in your model’s Predictor and Training objects.

curl http://localhost:5001/openapi.json

You can also view this in your browser at localhost:5001/openapi.json.

CLI Commands

Python SDK

HTTP API

Redis Queue

HTTP API Overview

Server Setup

Making Predictions

Synchronous vs Asynchronous Predictions

Idempotent Predictions

Endpoint Summary

Server Options

—threads

—host

Health Check Endpoint

GET /health-check

OpenAPI Schema

GET /openapi.json

Build docs developers (and LLMs) love

CLI Commands

Python SDK

HTTP API

Redis Queue

​Server Setup

​Making Predictions

​Synchronous vs Asynchronous Predictions

​Idempotent Predictions

​Endpoint Summary

​Server Options

​—threads

​—host

​Health Check Endpoint

​GET /health-check

​OpenAPI Schema

​GET /openapi.json

Build docs developers (and LLMs) love

Server Setup

Making Predictions

Synchronous vs Asynchronous Predictions

Idempotent Predictions

Endpoint Summary

Server Options

—threads

—host

Health Check Endpoint

GET /health-check

OpenAPI Schema

GET /openapi.json