Incident Ticket Format

Ticket Structure

Each incident ticket is a JSON file stored in the incidents/ directory with a standardized structure. All fields provide critical information for investigating and resolving issues.

Complete Example

Here’s a complete incident ticket showing all fields:

{
  "id": "INC-001",
  "title": "Server returns 500 error on POST /api/auth/login after latest deployment",
  "severity": "P1 - Critical",
  "service": "python-service",
  "reported_by": "Frontend Team",
  "environment": "staging",
  "timestamp": "2026-02-28T14:23:00Z",
  "description": "After the latest deployment to staging, users are unable to log in. The POST /api/auth/login endpoint returns a 500 Internal Server Error for all login attempts, even with valid credentials. This was working fine before the last merge to main.",
  "steps_to_reproduce": [
    "Send a POST request to /api/auth/login with valid credentials",
    "Payload: {\"email\": \"[email protected]\", \"password\": \"correctpassword\"}",
    "Receive 500 Internal Server Error instead of 200 with JWT token"
  ],
  "error_log": "TypeError: a bytes-like object is required, not 'str' in app/routes/auth.py line 42",
  "expected_behavior": "Returns 200 with JWT token for valid credentials, 401 for invalid credentials",
  "actual_behavior": "Returns 500 Internal Server Error for all requests to /api/auth/login",
  "recent_changes": "Updated bcrypt library from 3.2.0 to 4.1.2 and refactored auth routes",
  "tags": ["runtime-crash", "authentication", "blocking"]
}

Field Reference

id

Type: String (required) Format: INC-XXX where XXX is a zero-padded number Description: Unique identifier for the incident. Use the numbering convention:

INC-001 to INC-099: Python service incidents
INC-101 to INC-199: Node.js service incidents

Example:

"id": "INC-001"

title

Type: String (required) Description: Concise summary of the issue. Should include:

The affected endpoint or feature
The symptom or error type
Context about when it occurs

Best Practices:

Be specific (include endpoint names, error codes)
Keep under 100 characters
Front-load the most important information

Good Examples:

"title": "Server returns 500 error on POST /api/auth/login after latest deployment"
"title": "Node.js server crashes with unhandled promise rejection on GET /api/users/:id"
"title": "/api/orders endpoint response time increased from 200ms to 8s"

Bad Examples:

"title": "Login broken"  // Too vague
"title": "Error"  // No context
"title": "There's a problem with the authentication system that started happening after we deployed the latest changes to production"  // Too verbose

severity

Type: String (required) Description: Priority level indicating urgency and impact. Valid Values:

"P0 - Critical" - Complete service outage or data loss
"P1 - Critical" - Core functionality broken, blocking users
"P2 - High" - Major feature degraded, workaround available
"P3 - Medium" - Minor issue, limited user impact

Example:

"severity": "P1 - Critical"

service

Type: String (required) Description: Which microservice is affected by this incident. Valid Values:

"python-service" - Python Flask API
"node-service" - Node.js Express API

Example:

"service": "python-service"

reported_by

Type: String (required) Description: Team or individual who reported the incident. Common Values:

"Frontend Team"
"Backend Team"
"DevOps Team"
"Platform Team"
"Customer Support"
"QA Team"

Example:

"reported_by": "Frontend Team"

environment

Type: String (required) Description: Environment where the issue occurs. Valid Values:

"development" - Local development
"staging" - Staging environment
"production" - Production environment

Example:

"environment": "staging"

timestamp

Type: String (required) Format: ISO 8601 datetime with timezone (UTC) Description: When the incident was first reported or observed. Example:

"timestamp": "2026-02-28T14:23:00Z"

description

Type: String (required) Description: Detailed explanation of the issue including:

What is broken
When it started occurring
User impact
Any relevant context

Best Practices:

Start with the user impact
Mention when the issue started
Include relevant context (deployment, config changes)
Be clear and concise

Example:

"description": "After the latest deployment to staging, users are unable to log in. The POST /api/auth/login endpoint returns a 500 Internal Server Error for all login attempts, even with valid credentials. This was working fine before the last merge to main."

steps_to_reproduce

Type: Array of strings (required) Description: Step-by-step instructions to reproduce the issue. Should be detailed enough that anyone can follow them and see the same behavior. Best Practices:

Number steps sequentially
Include exact API calls, payloads, or user actions
Specify what should happen at each step
Include sample data where relevant

Example:

"steps_to_reproduce": [
  "Send a POST request to /api/auth/login with valid credentials",
  "Payload: {\"email\": \"[email protected]\", \"password\": \"correctpassword\"}",
  "Receive 500 Internal Server Error instead of 200 with JWT token"
]

error_log

Type: String (required) Description: Actual error messages, stack traces, or log output from the system. Include:

Error type and message
Stack trace (especially the top frames)
File names and line numbers
Relevant log entries

Best Practices:

Copy errors verbatim
Include stack traces
Note file paths and line numbers
If no error occurs, describe the symptom (e.g., “No errors — endpoint returns correct data but very slowly”)

Example (with error):

"error_log": "TypeError: a bytes-like object is required, not 'str' in app/routes/auth.py line 42"

Example (no error, performance issue):

"error_log": "No errors — endpoint returns correct data but very slowly. DB logs show hundreds of SELECT queries per single API call."

Example (with stack trace):

"error_log": "UnhandledPromiseRejectionWarning: Error: User not found\n    at UserService.getById (/app/src/services/userService.js:25:11)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)\nThis error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled."

expected_behavior

Type: String (required) Description: What should happen when the system is working correctly. Be specific about:

Expected response codes
Expected response format
Expected timing/performance
Expected side effects

Example:

"expected_behavior": "Returns 200 with JWT token for valid credentials, 401 for invalid credentials"

"expected_behavior": "Returns 404 JSON response: {\"error\": \"User not found\"} without crashing"

"expected_behavior": "Endpoint returns orders with items in under 500ms"

actual_behavior

Type: String (required) Description: What actually happens when the issue occurs. Be specific about:

Actual response codes
Actual response format
Actual timing/performance
Actual side effects

Example:

"actual_behavior": "Returns 500 Internal Server Error for all requests to /api/auth/login"

"actual_behavior": "Entire server process crashes on unhandled promise rejection"

"actual_behavior": "Endpoint takes 5-10 seconds and generates excessive database queries"

recent_changes

Type: String (required) Description: What changed before the issue appeared. Include:

Code changes (commits, PRs, refactors)
Dependency updates
Configuration changes
Infrastructure changes
Deployments

Best Practices:

Be specific about what changed
Include version numbers for dependency updates
Mention relevant commits or PRs if known
If unknown, state “No recent changes known”

Example:

"recent_changes": "Updated bcrypt library from 3.2.0 to 4.1.2 and refactored auth routes"

"recent_changes": "Migrated user routes from callbacks to async/await"

"recent_changes": "Infrastructure team updated environment variable naming convention across all services"

Creating a New Ticket

Use this template to create new incident tickets:

{
  "id": "INC-XXX",
  "title": "",
  "severity": "",
  "service": "",
  "reported_by": "",
  "environment": "",
  "timestamp": "",
  "description": "",
  "steps_to_reproduce": [
    ""
  ],
  "error_log": "",
  "expected_behavior": "",
  "actual_behavior": "",
  "recent_changes": "",
  "tags": []
}

Real-World Examples

Example 1: Runtime Crash (Node.js)

{
  "id": "INC-101",
  "title": "Node.js server crashes with unhandled promise rejection on GET /api/users/:id",
  "severity": "P1 - Critical",
  "service": "node-service",
  "reported_by": "Frontend Team",
  "environment": "production",
  "timestamp": "2026-02-28T13:00:00Z",
  "description": "The Node.js service crashes intermittently when fetching user profiles. When a user ID that doesn't exist in the database is requested, the server process exits with an unhandled promise rejection instead of returning a 404 response. This causes downtime until the process restarts.",
  "steps_to_reproduce": [
    "Send GET /api/users/99999 (non-existent user ID)",
    "Server process crashes",
    "All concurrent requests fail until process restarts"
  ],
  "error_log": "UnhandledPromiseRejectionWarning: Error: User not found\n    at UserService.getById (/app/src/services/userService.js:25:11)\n    at processTicksAndRejections (internal/process/task_queues.js:95:5)\nThis error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled.",
  "expected_behavior": "Returns 404 JSON response: {\"error\": \"User not found\"} without crashing",
  "actual_behavior": "Entire server process crashes on unhandled promise rejection",
  "recent_changes": "Migrated user routes from callbacks to async/await",
  "tags": ["runtime-crash", "async-error", "process-exit"]
}

Example 2: Configuration Issue (Python)

{
  "id": "INC-002",
  "title": "App fails to connect to database in staging environment",
  "severity": "P1 - Critical",
  "service": "python-service",
  "reported_by": "DevOps Team",
  "environment": "staging",
  "timestamp": "2026-02-28T09:15:00Z",
  "description": "The Python service cannot connect to PostgreSQL in the staging environment. The application starts but immediately crashes when any endpoint requiring database access is hit. The database is confirmed running and accessible from other services. This issue only occurs in staging/production — local development works fine.",
  "steps_to_reproduce": [
    "Deploy python-service to staging environment",
    "Attempt to hit any database-backed endpoint (e.g., GET /api/products)",
    "Observe connection refused error"
  ],
  "error_log": "sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused\n\tIs the server running on host \"localhost\" and accepting TCP/IP connections on port 5432?",
  "expected_behavior": "Service connects to the staging PostgreSQL database configured via environment variables",
  "actual_behavior": "Service always tries to connect to localhost:5432 regardless of environment variables",
  "recent_changes": "Infrastructure team updated environment variable naming convention across all services",
  "tags": ["misconfiguration", "database", "environment"]
}

Example 3: Performance Issue (Python)

{
  "id": "INC-005",
  "title": "/api/orders endpoint response time increased from 200ms to 8s",
  "severity": "P2 - High",
  "service": "python-service",
  "reported_by": "Platform Team",
  "environment": "production",
  "timestamp": "2026-02-28T08:00:00Z",
  "description": "The GET /api/orders endpoint response time has degraded severely. Average response time went from ~200ms to over 8 seconds after the last release. The endpoint lists orders for a user including their order items. Database CPU utilization has spiked to 85% during peak hours.",
  "steps_to_reproduce": [
    "Create a user with 10+ orders, each having 3-5 items",
    "Send GET /api/orders with the user's auth token",
    "Observe response time > 5 seconds"
  ],
  "error_log": "No errors — endpoint returns correct data but very slowly. DB logs show hundreds of SELECT queries per single API call.",
  "expected_behavior": "Endpoint returns orders with items in under 500ms",
  "actual_behavior": "Endpoint takes 5-10 seconds and generates excessive database queries",
  "recent_changes": "Refactored order listing to include order items in the response payload",
  "tags": ["performance", "n+1-query", "database"]
}

Validation Checklist

Before submitting an incident ticket, verify:

Next Steps

Review Getting Started for incident workflow and best practices
Browse existing incidents in incidents/ directory for more examples
Set up templates or scripts to generate new tickets from this format

Overview

Python Service Issues

Node Service Issues

Ticket Structure

Complete Example

Field Reference

id

title

severity

service

reported_by

environment

timestamp

description

steps_to_reproduce

error_log

expected_behavior

actual_behavior

recent_changes

tags

Creating a New Ticket

Real-World Examples

Example 1: Runtime Crash (Node.js)

Example 2: Configuration Issue (Python)

Example 3: Performance Issue (Python)

Validation Checklist

Next Steps

Build docs developers (and LLMs) love

Overview

Python Service Issues

Node Service Issues

​Ticket Structure

​Complete Example

​Field Reference

​id

​title

​severity

​service

​reported_by

​environment

​timestamp

​description

​steps_to_reproduce

​error_log

​expected_behavior

​actual_behavior

​recent_changes

​tags

​Creating a New Ticket

​Real-World Examples

​Example 1: Runtime Crash (Node.js)

​Example 2: Configuration Issue (Python)

​Example 3: Performance Issue (Python)

​Validation Checklist

​Next Steps

Build docs developers (and LLMs) love

Ticket Structure

Complete Example

Field Reference

id

title

severity

service

reported_by

environment

timestamp

description

steps_to_reproduce

error_log

expected_behavior

actual_behavior

recent_changes

tags

Creating a New Ticket

Real-World Examples

Example 1: Runtime Crash (Node.js)

Example 2: Configuration Issue (Python)

Example 3: Performance Issue (Python)

Validation Checklist

Next Steps