Skip to main content
Aurora’s incident investigation automatically analyzes production incidents from your observability platforms, performing root cause analysis (RCA) in the background and providing actionable insights.

How It Works

1. Alert Detection & Correlation

When an alert arrives from a connected platform (Grafana, PagerDuty, Datadog, Netdata, Dynatrace, Splunk, Jenkins, or CloudBees), Aurora:
  • Creates an incident record in the incidents table
  • Analyzes the alert metadata for correlation opportunities
  • Groups related alerts using multiple strategies:
    • Service matching: Alerts affecting the same service
    • Time-based clustering: Alerts within a 5-minute window
    • Semantic similarity: ML-based alert description matching

Multi-Platform Support

Ingest alerts from Grafana, PagerDuty, Datadog, Netdata, Dynatrace, Splunk, Jenkins, and CloudBees

Smart Correlation

Automatically group related alerts to reduce noise and identify incident blast radius

2. Background RCA Workflow

Once an incident is created, Aurora launches a background Celery task that runs the RCA investigation:
# server/chat/background/task.py
run_background_chat.delay(
    user_id=user_id,
    session_id=session_id,
    initial_message=rca_context,
    incident_id=incident_id,  # Links investigation to incident
    trigger_metadata=metadata,
    send_notifications=True
)
The RCA agent:
  • Analyzes alert metadata, service context, and historical patterns
  • Runs diagnostic commands using the kubectl_tool, cloud_tool, and observability integrations
  • Generates streaming thoughts that are saved incrementally to incident_thoughts table
  • Produces diagnostic suggestions and fix suggestions stored in incident_suggestions
The investigation runs autonomously in the background using a LangGraph workflow powered by the Workflow class in server/chat/backend/agent/workflow.py. This ensures the RCA continues even if the user closes the browser.

3. Real-Time Streaming UI

As the RCA progresses, the incident detail page displays real-time investigation progress: Incident List Page (/incidents):
  • Shows all active and analyzed incidents
  • Real-time updates via Server-Sent Events (SSE)
  • Status indicators: investigating, analyzed, resolved, merged
  • Displays correlated alert count and affected services
Incident Detail Page (/incidents/[id]):
  • Thoughts Panel: Streams AI reasoning as it investigates
    • Saved incrementally every 1 second or after sentence boundaries
    • Progressive display via polling (1-second interval)
    • Auto-opens during active investigation
  • Suggestions Tab: Shows diagnostic commands and fix suggestions
  • Chat Tab: Ask follow-up questions about the incident
  • Raw Alert Tab: View the original alert payload from source platform
// client/src/app/incidents/[id]/page.tsx
useEffect(() => {
  const pollIncident = async () => {
    const data = await incidentsService.getIncident(params.id as string);
    
    // Append only new thoughts that haven't been seen yet
    setThoughts(prevThoughts => {
      const newThoughts = data.streamingThoughts || [];
      const unseenThoughts = newThoughts.filter(thought => 
        !seenThoughtIdsRef.current.has(thought.id)
      );
      return [...prevThoughts, ...unseenThoughts];
    });
  };
  
  const interval = setInterval(pollIncident, 1000); // Poll every 1 second
  return () => clearInterval(interval);
}, [params.id, incident?.status]);

4. Investigation Status Lifecycle

Incidents progress through these states:
StatusDescription
investigatingRCA is actively running (Aurora analyzing in background)
analyzedRCA completed, waiting for user action
resolvedUser marked incident as resolved (triggers postmortem generation)
mergedIncident was merged into another related incident

User Workflows

Viewing Live Investigation

  1. Navigate to Incidents page
  2. Click on an incident with status investigating
  3. The Thoughts Panel auto-opens on the right side
  4. Watch AI reasoning stream in real-time as Aurora:
    • Analyzes alert context
    • Runs diagnostic commands
    • Identifies root cause patterns
    • Generates suggestions
Incident investigation UI showing streaming thoughts panelLive incident investigation with streaming thoughts and diagnostic suggestions

Interacting with Suggestions

Aurora generates two types of suggestions:
Type: diagnosticSafe, read-only commands to gather more information:
  • kubectl get pods -n production
  • gcloud logging read --limit=50 --filter="severity=ERROR"
  • View service logs, metrics, or configuration
Actions:
  • Click “Copy” to copy command to clipboard
  • Click “Execute” to run in Agent mode (if enabled)
Type: fixCode changes to resolve the incident:
  • Configuration updates
  • Bug fixes
  • Dependency version changes
Fields:
  • filePath: File to modify
  • originalContent: Current code
  • suggestedContent: Proposed fix
  • userEditedContent: User’s customized version
Actions:
  1. Review the suggested code change
  2. Edit the fix if needed using the inline editor
  3. Click “Apply Fix” to create a GitHub branch and pull request
  4. The PR is created with:
    • Branch: aurora/fix-incident-{incident_id}-{timestamp}
    • Commit message: Incident context + fix description
    • Link back to incident in Aurora

Follow-Up Chat

Ask questions about an incident using the Chat tab:
// POST /api/incidents/{incident_id}/chat
{
  "question": "What was the CPU usage trend before the incident?",
  "mode": "ask"  // or "agent" for execution capability
}
The chat:
  • Loads incident context (alert details, RCA summary, investigation thoughts)
  • Runs as a separate background chat session (not a new RCA)
  • Supports both ask (read-only) and agent (execution) modes
  • Saved in chat_sessions table with incident_id foreign key
// server/routes/incidents_routes.py
if existing_session_id:
    # Continue existing chat
    session_id = existing_session_id
    full_message = question
else:
    # Build context for new chat
    context_prefix = f"""<context>
<incident>
Title: {alert_title}
Severity: {severity}
Summary: {summary}
</incident>
<investigation_progress>
{investigation_thoughts}
</investigation_progress>
</context>

<user_message>
{question}
</user_message>"""
    
    session_id = create_background_chat_session(
        user_id=user_id,
        incident_id=incident_id,  # Link to incident
        title=f"Incident: {question[:50]}..."
    )
When you discover two incidents are related:
  1. Navigate to the incident that should be merged
  2. Click “Merge Alert” in the UI
  3. Select the target incident to merge into
  4. Aurora will:
    • Stop the source incident’s RCA (via Celery task revocation)
    • Copy the alert to target incident’s incident_alerts table
    • Transfer RCA context to target investigation
    • Mark source incident as merged

Citations & Command Traceability

All diagnostic commands executed during RCA are tracked:
  • Stored in incident_citations table with citation_key (numeric reference)
  • Includes: tool name, command, output, execution timestamp
  • Referenced inline in thoughts using [1], [2] notation
  • Click citation badge to view full command output in modal
# Example citation record
{
  "key": "1",
  "toolName": "kubectl_tool",
  "command": "kubectl get pods -n production",
  "output": "NAME                     READY   STATUS\napi-5d4c8b7f9-x7z2p      1/1     Running",
  "executedAt": "2026-03-03T10:30:15Z"
}

API Reference

Get All Incidents

GET /api/incidents
Returns list of incidents for the current user (excludes merged status).

Get Incident Details

GET /api/incidents/{incident_id}
Returns full incident data including:
  • Alert metadata and raw payload
  • Streaming thoughts
  • Suggestions (diagnostic + fix)
  • Citations
  • Chat sessions
  • Correlated alerts

Update Incident Status

PATCH /api/incidents/{incident_id}
{
  "status": "resolved",
  "auroraStatus": "complete",
  "summary": "Root cause: Database connection pool exhausted"
}

Chat with Incident

POST /api/incidents/{incident_id}/chat?session_id={optional}
{
  "question": "What was the database connection count?",
  "mode": "ask"
}
Returns 202 Accepted with session_id to poll for response.
See AI Chat Interface for details on WebSocket streaming and session management.

Observability Tools

Connect Grafana, Datadog, Netdata, and more to ingest alerts

Cloud Integrations

Execute diagnostic commands across GCP, AWS, and Azure

Build docs developers (and LLMs) love