Introduction
The ShopStack Platform incident tracking system uses structured JSON tickets to document, track, and resolve issues across all microservices. Each incident is stored as a JSON file in theincidents/ directory, providing a clear audit trail and standardized format for issue resolution.
Incident Numbering Convention
Incidents are assigned unique identifiers based on the service:- INC-001 to INC-099: Python service incidents
- INC-101 to INC-199: Node.js service incidents
Severity Levels
Incidents are classified by severity to prioritize resolution efforts:| Severity | Description | Response Time |
|---|---|---|
| P0 - Critical | Complete service outage, data loss | Immediate |
| P1 - Critical | Core functionality broken, blocking users | < 1 hour |
| P2 - High | Major feature degraded, workaround available | < 4 hours |
| P3 - Medium | Minor issue, limited user impact | < 24 hours |
Incident Workflow
1. Create an Incident Ticket
When an issue is discovered:- Create a new JSON file in
incidents/using the next available ID - Fill in all required fields (see Ticket Format)
- Include detailed reproduction steps and error logs
- Tag appropriately for categorization
2. Investigation
- Review the incident details and error logs
- Reproduce the issue using the provided steps
- Check recent changes that may have introduced the bug
- Examine related code paths and dependencies
- Document findings in the investigation notes
3. Resolution
- Identify the root cause
- Implement a fix with tests to prevent regression
- Verify the fix resolves the issue in all environments
- Update the incident ticket with resolution details
4. Post-Mortem (P0/P1 only)
For critical incidents:- Document timeline of events
- Identify root cause and contributing factors
- List action items to prevent recurrence
- Share learnings with the team
Common Issue Categories
Runtime Crashes
Issues that cause the service to crash or become unresponsive. Example: INC-101 - Unhandled promise rejection crashes Node.js serverAuthentication Issues
Problems with user login, registration, or token validation. Example: INC-001 - Login endpoint returns 500 errorConfiguration Issues
Misconfigurations that prevent services from running properly. Example: INC-002 - Database connection failure in stagingPerformance Issues
Degradation in response times or resource consumption. Example: INC-005 - Slow orders endpointBest Practices
Writing Good Incident Tickets
DO:- Be specific in the title (include endpoint, error type, or symptom)
- Provide complete reproduction steps that anyone can follow
- Include actual error messages and stack traces
- Document what changed recently (deployment, config, dependency)
- Use clear expected vs. actual behavior descriptions
- Use vague titles like “Login broken” or “API not working”
- Assume the reader knows the context
- Skip reproduction steps
- Omit error logs or stack traces
- Forget to specify the environment
Investigation Tips
- Start with recent changes - Most bugs are introduced by recent code or config changes
- Check the error logs first - Error messages often point directly to the issue
- Reproduce locally - Always verify you can reproduce before investigating
- Look for patterns - Similar incidents may share root causes
- Check dependencies - Library upgrades can introduce breaking changes
Resolution Guidelines
- Write tests first - Add a failing test that reproduces the bug
- Fix the root cause - Don’t just patch symptoms
- Test in all environments - Verify the fix works in dev, staging, and production
- Document the fix - Update the incident ticket with what was changed and why
- Consider prevention - Add monitoring, validation, or safeguards to prevent recurrence
Using Incident Data
Identify Trends
Review incident tags to identify recurring issues:Track Service Health
Monitor incident counts and severity by service:Improve Quality
Use incident data to:- Identify areas needing more tests
- Find code that changes frequently and breaks often
- Prioritize technical debt reduction
- Improve deployment and rollback processes
Next Steps
- Review the Ticket Format documentation for detailed field descriptions
- Browse existing incidents in
incidents/to see real examples - Set up monitoring to catch issues before users report them
- Establish incident response runbooks for common scenarios