Getting Started with Incident Tracking

Introduction

The ShopStack Platform incident tracking system uses structured JSON tickets to document, track, and resolve issues across all microservices. Each incident is stored as a JSON file in the incidents/ directory, providing a clear audit trail and standardized format for issue resolution.

Incident Numbering Convention

Incidents are assigned unique identifiers based on the service:

INC-001 to INC-099: Python service incidents
INC-101 to INC-199: Node.js service incidents

This convention makes it easy to identify which service an incident affects at a glance.

Severity Levels

Incidents are classified by severity to prioritize resolution efforts:

Severity	Description	Response Time
P0 - Critical	Complete service outage, data loss	Immediate
P1 - Critical	Core functionality broken, blocking users	< 1 hour
P2 - High	Major feature degraded, workaround available	< 4 hours
P3 - Medium	Minor issue, limited user impact	< 24 hours

Incident Workflow

1. Create an Incident Ticket

When an issue is discovered:

Create a new JSON file in incidents/ using the next available ID
Fill in all required fields (see Ticket Format)
Include detailed reproduction steps and error logs
Tag appropriately for categorization

2. Investigation

Review the incident details and error logs
Reproduce the issue using the provided steps
Check recent changes that may have introduced the bug
Examine related code paths and dependencies
Document findings in the investigation notes

3. Resolution

Identify the root cause
Implement a fix with tests to prevent regression
Verify the fix resolves the issue in all environments
Update the incident ticket with resolution details

4. Post-Mortem (P0/P1 only)

For critical incidents:

Document timeline of events
Identify root cause and contributing factors
List action items to prevent recurrence
Share learnings with the team

Common Issue Categories

Runtime Crashes

Issues that cause the service to crash or become unresponsive. Example: INC-101 - Unhandled promise rejection crashes Node.js server

{
  "id": "INC-101",
  "title": "Node.js server crashes with unhandled promise rejection on GET /api/users/:id",
  "severity": "P1 - Critical",
  "service": "node-service",
  "tags": ["runtime-crash", "async-error", "process-exit"]
}

Authentication Issues

Problems with user login, registration, or token validation. Example: INC-001 - Login endpoint returns 500 error

{
  "id": "INC-001",
  "title": "Server returns 500 error on POST /api/auth/login after latest deployment",
  "severity": "P1 - Critical",
  "service": "python-service",
  "tags": ["runtime-crash", "authentication", "blocking"]
}

Configuration Issues

Misconfigurations that prevent services from running properly. Example: INC-002 - Database connection failure in staging

{
  "id": "INC-002",
  "title": "App fails to connect to database in staging environment",
  "severity": "P1 - Critical",
  "service": "python-service",
  "tags": ["misconfiguration", "database", "environment"]
}

Performance Issues

Degradation in response times or resource consumption. Example: INC-005 - Slow orders endpoint

{
  "id": "INC-005",
  "title": "/api/orders endpoint response time increased from 200ms to 8s",
  "severity": "P2 - High",
  "service": "python-service",
  "tags": ["performance", "n+1-query", "database"]
}

Best Practices

Writing Good Incident Tickets

DO:

Be specific in the title (include endpoint, error type, or symptom)
Provide complete reproduction steps that anyone can follow
Include actual error messages and stack traces
Document what changed recently (deployment, config, dependency)
Use clear expected vs. actual behavior descriptions

DON’T:

Use vague titles like “Login broken” or “API not working”
Assume the reader knows the context
Skip reproduction steps
Omit error logs or stack traces
Forget to specify the environment

Investigation Tips

Start with recent changes - Most bugs are introduced by recent code or config changes
Check the error logs first - Error messages often point directly to the issue
Reproduce locally - Always verify you can reproduce before investigating
Look for patterns - Similar incidents may share root causes
Check dependencies - Library upgrades can introduce breaking changes

Resolution Guidelines

Write tests first - Add a failing test that reproduces the bug
Fix the root cause - Don’t just patch symptoms
Test in all environments - Verify the fix works in dev, staging, and production
Document the fix - Update the incident ticket with what was changed and why
Consider prevention - Add monitoring, validation, or safeguards to prevent recurrence

Using Incident Data

Identify Trends

Review incident tags to identify recurring issues:

# Count incidents by tag
grep -r '"tags"' incidents/ | grep -o '"[^"]*"' | sort | uniq -c | sort -rn

Track Service Health

Monitor incident counts and severity by service:

# List all P1 incidents
grep -l '"P1 - Critical"' incidents/*.json

# Count incidents by service
grep -h '"service"' incidents/*.json | sort | uniq -c

Improve Quality

Use incident data to:

Identify areas needing more tests
Find code that changes frequently and breaks often
Prioritize technical debt reduction
Improve deployment and rollback processes

Next Steps

Review the Ticket Format documentation for detailed field descriptions
Browse existing incidents in incidents/ to see real examples
Set up monitoring to catch issues before users report them
Establish incident response runbooks for common scenarios

Overview

Python Service Issues

Node Service Issues

Getting Started with Incident Tracking

Introduction

Incident Numbering Convention

Severity Levels

Incident Workflow

1. Create an Incident Ticket

2. Investigation

3. Resolution

4. Post-Mortem (P0/P1 only)

Common Issue Categories

Runtime Crashes

Authentication Issues

Configuration Issues

Performance Issues

Best Practices

Writing Good Incident Tickets

Investigation Tips

Resolution Guidelines

Using Incident Data

Identify Trends

Track Service Health

Improve Quality

Next Steps

Build docs developers (and LLMs) love

Overview

Python Service Issues

Node Service Issues

​Introduction

​Incident Numbering Convention

​Severity Levels

​Incident Workflow

​1. Create an Incident Ticket

​2. Investigation

​3. Resolution

​4. Post-Mortem (P0/P1 only)

​Common Issue Categories

​Runtime Crashes

​Authentication Issues

​Configuration Issues

​Performance Issues

​Best Practices

​Writing Good Incident Tickets

​Investigation Tips

​Resolution Guidelines

​Using Incident Data

​Identify Trends

​Track Service Health

​Improve Quality

​Next Steps

Build docs developers (and LLMs) love

Introduction

Incident Numbering Convention

Severity Levels

Incident Workflow

1. Create an Incident Ticket

2. Investigation

3. Resolution

4. Post-Mortem (P0/P1 only)

Common Issue Categories

Runtime Crashes

Authentication Issues

Configuration Issues

Performance Issues

Best Practices

Writing Good Incident Tickets

Investigation Tips

Resolution Guidelines

Using Incident Data

Identify Trends

Track Service Health

Improve Quality

Next Steps