Skip to main content
When you run a Docker image built by Cog, it serves an HTTP API for making predictions on your model.

Server Setup

First, build your model:
cog build -t my-model
Then, start the Docker container:
# If your model uses a CPU:
docker run -d -p 5001:5000 my-model

# If your model uses a GPU:
docker run -d -p 5001:5000 --gpus all my-model

# If you're on an M1 Mac:
docker run -d -p 5001:5000 --platform=linux/amd64 my-model
The server is now running locally on port 5001.

Making Predictions

To run a prediction on the model, call the /predictions endpoint:
curl http://localhost:5001/predictions -X POST \
    --header "Content-Type: application/json" \
    --data '{"input": {"image": "https://.../input.jpg"}}'

Synchronous vs Asynchronous Predictions

The server supports both synchronous and asynchronous prediction creation:
  • Synchronous: The server waits until the prediction is completed and responds with the result.
  • Asynchronous: The server immediately returns a response and processes the prediction in the background.
The client can create a prediction asynchronously by setting the Prefer: respond-async header in their request. When provided, the server responds immediately after starting the prediction with 202 Accepted status and a prediction object in status processing.
The only supported way to receive updates on the status of predictions started asynchronously is using webhooks. Polling for prediction status is not currently supported.

Idempotent Predictions

You can use certain server endpoints to create predictions idempotently, such that if a client calls this endpoint more than once with the same ID (for example, due to a network interruption) while the prediction is still running, no new prediction is created. Instead, the client receives a 202 Accepted response with the initial state of the prediction.

Endpoint Summary

Here’s a summary of the prediction creation endpoints:
EndpointHeaderBehavior
POST /predictions-Synchronous, non-idempotent
POST /predictionsPrefer: respond-asyncAsynchronous, non-idempotent
PUT /predictions/<prediction_id>-Synchronous, idempotent
PUT /predictions/<prediction_id>Prefer: respond-asyncAsynchronous, idempotent
Choose the endpoint that best fits your needs:
  • Use synchronous endpoints when you want to wait for the prediction result.
  • Use asynchronous endpoints when you want to start a prediction and receive updates via webhooks.
  • Use idempotent endpoints when you need to safely retry requests without creating duplicate predictions.

Server Options

Cog Docker images have python -m cog.server.http set as the default command. When using command-line options, you need to pass in the full command before the options.

—threads

This controls how many threads are used by Cog, which determines how many requests Cog serves in parallel. If your model uses a CPU, this is the number of CPUs on your machine. If your model uses a GPU, this is 1, because typically a GPU can only be used by one process. You might need to adjust this if you want to control how much memory your model uses, or other similar constraints.
docker run -d -p 5000:5000 my-model python -m cog.server.http --threads=10

—host

By default, Cog serves to 0.0.0.0. You can override this using the --host option. For example, to serve Cog on an IPv6 address:
docker run -d -p 5000:5000 my-model python -m cog.server.http --host="::"

Health Check Endpoint

GET /health-check

Returns the current health status of the model container. This endpoint always responds with 200 OK — check the status field in the response body to determine readiness.
curl http://localhost:5001/health-check
Response:
{
    "status": "READY",
    "setup": {
        "started_at": "2025-01-01T00:00:00.000000+00:00",
        "completed_at": "2025-01-01T00:00:05.000000+00:00",
        "status": "succeeded",
        "logs": ""
    },
    "version": {
        "coglet": "0.17.0",
        "cog": "0.14.0",
        "python": "3.12.0"
    }
}
status
string
required
One of the following values:
  • STARTING: The model’s setup() method is still running.
  • READY: The model is ready to accept predictions.
  • BUSY: The model is ready but all prediction slots are in use.
  • SETUP_FAILED: The model’s setup() method raised an exception.
  • DEFUNCT: The model encountered an unrecoverable error.
  • UNHEALTHY: The model is ready but a user-defined healthcheck() method returned False.
setup
object
Setup phase details (included once setup has started).
version
object
Runtime version information.
user_healthcheck_error
string
Error message from a user-defined healthcheck() method (if applicable).

OpenAPI Schema

GET /openapi.json

The OpenAPI specification of the API, which is derived from the input and output types specified in your model’s Predictor and Training objects.
curl http://localhost:5001/openapi.json
You can also view this in your browser at localhost:5001/openapi.json.

Build docs developers (and LLMs) love