Run a prediction HTTP server. Builds the model and starts an HTTP server that exposes the model’s inputs and outputs as a REST API. Compatible with the Cog HTTP protocol.
Usage
Flags
The name of the config filecog serve -f custom-config.yaml
GPU devices to add to the container, in the same format as docker run --gpus
Upload URL for file outputs (e.g., https://example.com/upload/). When specified, the server uploads file outputs to this URL instead of returning them directly.cog serve --upload-url https://example.com/upload/
Set type of build progress output: auto, tty, plain, or quiet
Use pre-built Cog base image for faster cold boots
Use Nvidia CUDA base image: true, false, or auto
Examples
Start the server on default port
Output:
Building Docker image from environment in cog.yaml...
[+] Building 2.1s (12/12) FINISHED
Running 'python --check-hash-based-pycs never -m cog.server.http --await-explicit-shutdown true' in Docker with the current directory mounted as a volume...
Serving at http://127.0.0.1:8393
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:5000
The server is now running and ready to accept prediction requests.
Start on a custom port
Access the server at http://localhost:5000.
Test the server
Make a prediction request:
curl http://localhost:8393/predictions \
-X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"prompt": "a cat"}}'
Response:
{
"id": "abc123",
"status": "succeeded",
"input": {
"prompt": "a cat"
},
"output": "A fluffy orange tabby cat sitting on a windowsill",
"logs": "",
"error": null,
"created_at": "2024-01-15T10:30:00.000Z",
"started_at": "2024-01-15T10:30:00.100Z",
"completed_at": "2024-01-15T10:30:02.500Z"
}
curl http://localhost:8393/predictions \
-X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"image": "https://example.com/photo.jpg"}}'
Or with a base64-encoded file:
curl http://localhost:8393/predictions \
-X POST \
-H 'Content-Type: application/json' \
-d @- <<EOF
{
"input": {
"image": "data:image/jpeg;base64,$(base64 -i photo.jpg)"
}
}
EOF
Check server health
curl http://localhost:8393/health-check
Response:
Get OpenAPI schema
curl http://localhost:8393/openapi.json
Returns the full OpenAPI specification for your model’s API.
API Endpoints
The server exposes these endpoints:
POST /predictions
Create a prediction.
Request:
{
"input": {
"prompt": "a photo of a cat",
"width": 1024,
"height": 768
}
}
Response:
{
"id": "abc123",
"status": "succeeded",
"input": {...},
"output": "...",
"logs": "",
"error": null,
"created_at": "2024-01-15T10:30:00.000Z",
"started_at": "2024-01-15T10:30:00.100Z",
"completed_at": "2024-01-15T10:30:02.500Z"
}
GET /health-check
Check if the server is ready.
Response:
GET /openapi.json
Get the OpenAPI schema.
Response: Full OpenAPI 3.0 specification
POST /shutdown
Shutdown the server gracefully.
The server handles various input types:
Strings
{"input": {"prompt": "hello world"}}
Numbers
{"input": {"width": 1024, "temperature": 0.8}}
Booleans
{"input": {"use_refiner": true}}
Files (URLs)
{"input": {"image": "https://example.com/photo.jpg"}}
Files (base64 data URLs)
{"input": {"image": "data:image/jpeg;base64,/9j/4AAQ..."}}
Arrays
{"input": {"images": [
"https://example.com/photo1.jpg",
"https://example.com/photo2.jpg"
]}}
Output Types
Strings
{"output": "Generated text"}
Numbers/Booleans
{"output": 42}
{"output": true}
Files
Returned as data URLs:
{"output": "data:image/png;base64,iVBORw0KGgo..."}
With --upload-url, files are uploaded and returned as URLs:
{"output": "https://example.com/outputs/abc123.png"}
Arrays
{"output": [
"data:image/png;base64,iVBORw0KGgo...",
"data:image/png;base64,iVBORw0KGgo..."
]}
Objects
{"output": {
"result": "success",
"score": 0.95
}}
Error Handling
When predictions fail, the response includes error details:
{
"id": "abc123",
"status": "failed",
"error": "Invalid input: prompt must not be empty",
"logs": "...",
"created_at": "2024-01-15T10:30:00.000Z",
"started_at": "2024-01-15T10:30:00.100Z",
"completed_at": "2024-01-15T10:30:00.200Z"
}
File Upload Configuration
Default behavior
By default, file outputs are returned as base64 data URLs in the response:
With upload URL
Files are uploaded to the specified URL:
cog serve --upload-url https://example.com/upload/
The server:
- Generates the output file
- POSTs it to the upload URL
- Returns the URL in the response
This is useful for:
- Large files that exceed response size limits
- External storage systems
- CDN integration
GPU Configuration
Cog automatically detects GPU requirements:
# Auto-detected from cog.yaml
cog serve
# Explicitly use all GPUs
cog serve --gpus all
# Use specific GPUs
cog serve --gpus '"device=0,1"'
# Disable GPU
cog serve --gpus ""
Development Workflow
Local development
-
Start the server:
-
Make changes to your code
-
Restart the server (Ctrl+C, then
cog serve again)
-
Test with curl or your application
Hot reloading
The current directory is mounted as a volume, so you can:
- Edit Python files
- Restart the server to pick up changes
- No need to rebuild the Docker image
Integration Examples
Python client
import requests
response = requests.post(
"http://localhost:8393/predictions",
json={"input": {"prompt": "a cat"}}
)
print(response.json())
JavaScript client
const response = await fetch('http://localhost:8393/predictions', {
method: 'POST',
headers: {'Content-Type': 'application/json'},
body: JSON.stringify({
input: {prompt: 'a cat'}
})
});
const prediction = await response.json();
console.log(prediction);
Using the Replicate client
import replicate
# Point to local server
replicate.Client(api_url="http://localhost:8393")
output = replicate.run(
"local",
input={"prompt": "a cat"}
)
Logs
Server logs include:
- Request details
- Prediction timing
- Your model’s print statements
- Error traces
View logs in the terminal where you ran cog serve.
Shutdown
Gracefully shutdown the server:
# From the terminal
Ctrl+C
Or programmatically:
curl -X POST http://localhost:8393/shutdown
How It Works
-
Build phase:
- Reads
cog.yaml
- Builds a Docker image
- Mounts current directory as a volume
-
Server startup:
- Starts the Cog HTTP server (Rust/Axum)
- Runs your model’s
setup() method
- Begins listening on specified port
-
Prediction handling:
- Receives HTTP requests
- Validates inputs against schema
- Runs your
predict() method
- Returns formatted output
-
Shutdown:
- Handles graceful shutdown
- Cleans up resources
The Cog HTTP server is built with Rust for high performance:
- Low latency request handling
- Efficient memory usage
- Automatic request queuing
- WebSocket support for streaming
See Also