AI-Powered Incident Investigation

Aurora’s incident investigation automatically analyzes production incidents from your observability platforms, performing root cause analysis (RCA) in the background and providing actionable insights.

How It Works

1. Alert Detection & Correlation

When an alert arrives from a connected platform (Grafana, PagerDuty, Datadog, Netdata, Dynatrace, Splunk, Jenkins, or CloudBees), Aurora:

Creates an incident record in the incidents table
Analyzes the alert metadata for correlation opportunities
Groups related alerts using multiple strategies:
- Service matching: Alerts affecting the same service
- Time-based clustering: Alerts within a 5-minute window
- Semantic similarity: ML-based alert description matching

Multi-Platform Support

Ingest alerts from Grafana, PagerDuty, Datadog, Netdata, Dynatrace, Splunk, Jenkins, and CloudBees

Smart Correlation

Automatically group related alerts to reduce noise and identify incident blast radius

2. Background RCA Workflow

Once an incident is created, Aurora launches a background Celery task that runs the RCA investigation:

# server/chat/background/task.py
run_background_chat.delay(
    user_id=user_id,
    session_id=session_id,
    initial_message=rca_context,
    incident_id=incident_id,  # Links investigation to incident
    trigger_metadata=metadata,
    send_notifications=True
)

The RCA agent:

Analyzes alert metadata, service context, and historical patterns
Runs diagnostic commands using the kubectl_tool, cloud_tool, and observability integrations
Generates streaming thoughts that are saved incrementally to incident_thoughts table
Produces diagnostic suggestions and fix suggestions stored in incident_suggestions

The investigation runs autonomously in the background using a LangGraph workflow powered by the Workflow class in server/chat/backend/agent/workflow.py. This ensures the RCA continues even if the user closes the browser.

3. Real-Time Streaming UI

As the RCA progresses, the incident detail page displays real-time investigation progress: Incident List Page (/incidents):

Shows all active and analyzed incidents
Real-time updates via Server-Sent Events (SSE)
Status indicators: investigating, analyzed, resolved, merged
Displays correlated alert count and affected services

Incident Detail Page (/incidents/[id]):

Thoughts Panel: Streams AI reasoning as it investigates
- Saved incrementally every 1 second or after sentence boundaries
- Progressive display via polling (1-second interval)
- Auto-opens during active investigation
Suggestions Tab: Shows diagnostic commands and fix suggestions
Chat Tab: Ask follow-up questions about the incident
Raw Alert Tab: View the original alert payload from source platform

// client/src/app/incidents/[id]/page.tsx
useEffect(() => {
  const pollIncident = async () => {
    const data = await incidentsService.getIncident(params.id as string);
    
    // Append only new thoughts that haven't been seen yet
    setThoughts(prevThoughts => {
      const newThoughts = data.streamingThoughts || [];
      const unseenThoughts = newThoughts.filter(thought => 
        !seenThoughtIdsRef.current.has(thought.id)
      );
      return [...prevThoughts, ...unseenThoughts];
    });
  };
  
  const interval = setInterval(pollIncident, 1000); // Poll every 1 second
  return () => clearInterval(interval);
}, [params.id, incident?.status]);

4. Investigation Status Lifecycle

Incidents progress through these states:

Status	Description
`investigating`	RCA is actively running (Aurora analyzing in background)
`analyzed`	RCA completed, waiting for user action
`resolved`	User marked incident as resolved (triggers postmortem generation)
`merged`	Incident was merged into another related incident

User Workflows

Viewing Live Investigation

Navigate to Incidents page
Click on an incident with status investigating
The Thoughts Panel auto-opens on the right side
Watch AI reasoning stream in real-time as Aurora:
- Analyzes alert context
- Runs diagnostic commands
- Identifies root cause patterns
- Generates suggestions

Incident investigation UI showing streaming thoughts panel

Live incident investigation with streaming thoughts and diagnostic suggestions

Interacting with Suggestions

Aurora generates two types of suggestions:

Diagnostic Suggestions

Type: diagnosticSafe, read-only commands to gather more information:

kubectl get pods -n production
gcloud logging read --limit=50 --filter="severity=ERROR"
View service logs, metrics, or configuration

Actions:

Click “Copy” to copy command to clipboard
Click “Execute” to run in Agent mode (if enabled)

Fix Suggestions

Type: fixCode changes to resolve the incident:

Configuration updates
Bug fixes
Dependency version changes

Fields:

filePath: File to modify
originalContent: Current code
suggestedContent: Proposed fix
userEditedContent: User’s customized version

Actions:

Review the suggested code change
Edit the fix if needed using the inline editor
Click “Apply Fix” to create a GitHub branch and pull request
The PR is created with:
- Branch: aurora/fix-incident-{incident_id}-{timestamp}
- Commit message: Incident context + fix description
- Link back to incident in Aurora

Follow-Up Chat

Ask questions about an incident using the Chat tab:

// POST /api/incidents/{incident_id}/chat
{
  "question": "What was the CPU usage trend before the incident?",
  "mode": "ask"  // or "agent" for execution capability
}

The chat:

Loads incident context (alert details, RCA summary, investigation thoughts)
Runs as a separate background chat session (not a new RCA)
Supports both ask (read-only) and agent (execution) modes
Saved in chat_sessions table with incident_id foreign key

// server/routes/incidents_routes.py
if existing_session_id:
    # Continue existing chat
    session_id = existing_session_id
    full_message = question
else:
    # Build context for new chat
    context_prefix = f"""<context>
<incident>
Title: {alert_title}
Severity: {severity}
Summary: {summary}
</incident>
<investigation_progress>
{investigation_thoughts}
</investigation_progress>
</context>

<user_message>
{question}
</user_message>"""
    
    session_id = create_background_chat_session(
        user_id=user_id,
        incident_id=incident_id,  # Link to incident
        title=f"Incident: {question[:50]}..."
    )

When you discover two incidents are related:

Navigate to the incident that should be merged
Click “Merge Alert” in the UI
Select the target incident to merge into
Aurora will:
- Stop the source incident’s RCA (via Celery task revocation)
- Copy the alert to target incident’s incident_alerts table
- Transfer RCA context to target investigation
- Mark source incident as merged

Citations & Command Traceability

All diagnostic commands executed during RCA are tracked:

Stored in incident_citations table with citation_key (numeric reference)
Includes: tool name, command, output, execution timestamp
Referenced inline in thoughts using [1], [2] notation
Click citation badge to view full command output in modal

# Example citation record
{
  "key": "1",
  "toolName": "kubectl_tool",
  "command": "kubectl get pods -n production",
  "output": "NAME                     READY   STATUS\napi-5d4c8b7f9-x7z2p      1/1     Running",
  "executedAt": "2026-03-03T10:30:15Z"
}

API Reference

Get All Incidents

GET /api/incidents

Returns list of incidents for the current user (excludes merged status).

Get Incident Details

GET /api/incidents/{incident_id}

Returns full incident data including:

Alert metadata and raw payload
Streaming thoughts
Suggestions (diagnostic + fix)
Citations
Chat sessions
Correlated alerts

Update Incident Status

PATCH /api/incidents/{incident_id}

{
  "status": "resolved",
  "auroraStatus": "complete",
  "summary": "Root cause: Database connection pool exhausted"
}

Chat with Incident

POST /api/incidents/{incident_id}/chat?session_id={optional}

{
  "question": "What was the database connection count?",
  "mode": "ask"
}

Returns 202 Accepted with session_id to poll for response.

See AI Chat Interface for details on WebSocket streaming and session management.

Observability Tools

Connect Grafana, Datadog, Netdata, and more to ingest alerts

Cloud Integrations

Execute diagnostic commands across GCP, AWS, and Azure

Get Started

Core Features

Architecture

Deployment

Configuration

Integrations

Cloud Providers

Observability

Development

Guides

Reference

Help

AI-Powered Incident Investigation

How It Works

1. Alert Detection & Correlation

Multi-Platform Support

Smart Correlation

2. Background RCA Workflow

3. Real-Time Streaming UI

4. Investigation Status Lifecycle

User Workflows

Viewing Live Investigation

Interacting with Suggestions

Follow-Up Chat

Citations & Command Traceability

API Reference

Get All Incidents

Get Incident Details

Update Incident Status

Chat with Incident

Observability Tools

Cloud Integrations

Build docs developers (and LLMs) love

Get Started

Core Features

Architecture

Deployment

Configuration

Integrations

Cloud Providers

Observability

Development

Guides

Reference

Help

​How It Works

​1. Alert Detection & Correlation

Multi-Platform Support

Smart Correlation

​2. Background RCA Workflow

​3. Real-Time Streaming UI

​4. Investigation Status Lifecycle

​User Workflows

​Viewing Live Investigation

​Interacting with Suggestions

​Follow-Up Chat

​Merging Related Incidents

​Citations & Command Traceability

​API Reference

​Get All Incidents

​Get Incident Details

​Update Incident Status

​Chat with Incident

​Related Features

Observability Tools

Cloud Integrations

Build docs developers (and LLMs) love

How It Works

1. Alert Detection & Correlation

2. Background RCA Workflow

3. Real-Time Streaming UI

4. Investigation Status Lifecycle

User Workflows

Viewing Live Investigation

Interacting with Suggestions

Follow-Up Chat

Merging Related Incidents

Citations & Command Traceability

API Reference

Get All Incidents

Get Incident Details

Update Incident Status

Chat with Incident

Related Features