Server Setup
First, build your model:Making Predictions
To run a prediction on the model, call the/predictions endpoint:
Synchronous vs Asynchronous Predictions
The server supports both synchronous and asynchronous prediction creation:- Synchronous: The server waits until the prediction is completed and responds with the result.
- Asynchronous: The server immediately returns a response and processes the prediction in the background.
Prefer: respond-async header in their request. When provided, the server responds immediately after starting the prediction with 202 Accepted status and a prediction object in status processing.
The only supported way to receive updates on the status of predictions started asynchronously is using webhooks. Polling for prediction status is not currently supported.
Idempotent Predictions
You can use certain server endpoints to create predictions idempotently, such that if a client calls this endpoint more than once with the same ID (for example, due to a network interruption) while the prediction is still running, no new prediction is created. Instead, the client receives a202 Accepted response with the initial state of the prediction.
Endpoint Summary
Here’s a summary of the prediction creation endpoints:| Endpoint | Header | Behavior |
|---|---|---|
POST /predictions | - | Synchronous, non-idempotent |
POST /predictions | Prefer: respond-async | Asynchronous, non-idempotent |
PUT /predictions/<prediction_id> | - | Synchronous, idempotent |
PUT /predictions/<prediction_id> | Prefer: respond-async | Asynchronous, idempotent |
- Use synchronous endpoints when you want to wait for the prediction result.
- Use asynchronous endpoints when you want to start a prediction and receive updates via webhooks.
- Use idempotent endpoints when you need to safely retry requests without creating duplicate predictions.
Server Options
Cog Docker images havepython -m cog.server.http set as the default command. When using command-line options, you need to pass in the full command before the options.
—threads
This controls how many threads are used by Cog, which determines how many requests Cog serves in parallel. If your model uses a CPU, this is the number of CPUs on your machine. If your model uses a GPU, this is 1, because typically a GPU can only be used by one process. You might need to adjust this if you want to control how much memory your model uses, or other similar constraints.—host
By default, Cog serves to0.0.0.0. You can override this using the --host option.
For example, to serve Cog on an IPv6 address:
Health Check Endpoint
GET /health-check
Returns the current health status of the model container. This endpoint always responds with200 OK — check the status field in the response body to determine readiness.
One of the following values:
STARTING: The model’ssetup()method is still running.READY: The model is ready to accept predictions.BUSY: The model is ready but all prediction slots are in use.SETUP_FAILED: The model’ssetup()method raised an exception.DEFUNCT: The model encountered an unrecoverable error.UNHEALTHY: The model is ready but a user-definedhealthcheck()method returnedFalse.
Setup phase details (included once setup has started).
Runtime version information.
Error message from a user-defined
healthcheck() method (if applicable).