Skip to main content
This guide covers the monitoring and metrics system for tracking encryption operations, analyzing failure rates, and maintaining system health.

Overview

The KMS includes a comprehensive monitoring service that tracks:
  • Key generation events and success rates
  • Encryption operations (client-side testing)
  • Decryption operations and failures
  • Performance metrics (operation duration)
  • Error patterns and failure reasons

Metrics Collection

The EncryptionMonitoringService automatically records metrics for all encryption operations:
// From: src/encryption/encryption-monitoring.service.ts:13-47
/**
 * Record metrics for encryption operations
 */
async recordMetric(
  operation: 'encrypt' | 'decrypt' | 'generate',
  status: 'success' | 'failure',
  {
    keyId = null,
    deviceId = null,
    errorReason = null,
    duration = 0,
  }: {
    keyId?: string | null;
    deviceId?: string | null;
    errorReason?: string | null;
    duration?: number;
  } = {},
): Promise<void> {
  try {
    await this.prismaService.encryptionMetric.create({
      data: {
        keyId,
        deviceId,
        operation,
        status,
        errorReason,
        duration,
        timestamp: new Date(),
      },
    });
  } catch (error) {
    // Log but don't throw - metrics should not break main functionality
    this.logger.error(
      `Failed to record encryption metric: ${error.message}`,
      error.stack,
    );
  }
}

Automatic Metric Recording

Metrics are automatically recorded in the encryption resolver:
// From: src/encryption/encryption.resolver.ts:25-56
@Mutation(() => EncryptionKeyOutput)
async generateClientEncryptionKey(
  @Args('input') input: ClientIdentityInput,
): Promise<EncryptionKeyOutput> {
  const startTime = Date.now();
  const { deviceId, appVersion } = input;

  try {
    const result = await this.encryptionService.generateClientEncryptionKey(
      deviceId,
      appVersion,
    );

    // Record metric
    await this.monitoringService.recordMetric('generate', 'success', {
      deviceId,
      keyId: result.keyId,
      duration: Date.now() - startTime,
    });

    return result;
  } catch (error) {
    // Record failure metric
    await this.monitoringService.recordMetric('generate', 'failure', {
      deviceId,
      errorReason: error.message,
      duration: Date.now() - startTime,
    });

    throw error;
  }
}

Metrics Summary Query

GraphQL Query

Retrieve a comprehensive metrics summary for a specific timeframe:
query GetMetrics {
  getEncryptionMetricsSummary(timeframeHours: 24)
}

Query Parameters

  • timeframeHours (optional): Number of hours to analyze (default: 24)

Response Format

The query returns a JSON string containing:
{
  "metrics": [
    {
      "operation": "generate",
      "status": "success",
      "count": "150",
      "avg_duration": 45.2
    },
    {
      "operation": "decrypt",
      "status": "success",
      "count": "523",
      "avg_duration": 12.8
    },
    {
      "operation": "decrypt",
      "status": "failure",
      "count": "7",
      "avg_duration": 8.3
    }
  ],
  "failureRates": [
    {
      "operation": "generate",
      "failure_rate": 1.2
    },
    {
      "operation": "decrypt",
      "failure_rate": 1.3
    }
  ],
  "topFailures": [
    {
      "operation": "decrypt",
      "reason": "Encryption key has expired",
      "count": "5"
    },
    {
      "operation": "decrypt",
      "reason": "Encryption key not found",
      "count": "2"
    }
  ],
  "timeframeHours": 24
}

Metrics Summary Implementation

// From: src/encryption/encryption-monitoring.service.ts:52-107
async generateMetricsSummary(timeframeHours: number = 24): Promise<any> {
  const timeThreshold = new Date();
  timeThreshold.setHours(timeThreshold.getHours() - timeframeHours);

  try {
    // Get total counts by operation and status
    const metrics = await this.prismaService.$queryRaw`
      SELECT 
        operation, 
        status, 
        COUNT(*) as count,
        AVG(duration) as avg_duration
      FROM "EncryptionMetric"
      WHERE timestamp >= ${timeThreshold}
      GROUP BY operation, status
    `;

    // Get failure rates
    const failureRates = await this.prismaService.$queryRaw`
      SELECT 
        operation,
        SUM(CASE WHEN status = 'failure' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as failure_rate
      FROM "EncryptionMetric"
      WHERE timestamp >= ${timeThreshold}
      GROUP BY operation
    `;

    // Top failure reasons
    const topFailures = await this.prismaService.$queryRaw<
      { operation: string; reason: string; count: bigint }[]
    >`
  SELECT 
    operation,
    "errorReason" AS reason,  
    COUNT(*) AS count
  FROM "EncryptionMetric"
  WHERE status = 'failure' AND timestamp >= ${timeThreshold}
  GROUP BY operation, "errorReason"
  ORDER BY count DESC
  LIMIT 10;
`;

    return {
      metrics,
      failureRates,
      topFailures,
      timeframeHours,
    };
  } catch (error) {
    this.logger.error(
      `Failed to generate metrics summary: ${error.message}`,
      error.stack,
    );
    throw new Error('Failed to generate encryption metrics summary');
  }
}

Finding Problematic Keys

Identify keys with high failure rates that may need rotation:
// From: src/encryption/encryption-monitoring.service.ts:112-141
async findProblematicKeys(
  failureThresholdPercent: number = 10,
): Promise<string[]> {
  try {
    const problematicKeys: any = await this.prismaService.$queryRaw`
      WITH key_stats AS (
        SELECT 
          key_id,
          SUM(CASE WHEN status = 'failure' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) as failure_rate,
          COUNT(*) as total_operations
        FROM "EncryptionMetric"
        WHERE key_id IS NOT NULL AND operation = 'decrypt'
        GROUP BY key_id
        HAVING COUNT(*) >= 5  -- Minimum number of operations to consider
      )
      SELECT key_id
      FROM key_stats
      WHERE failure_rate >= ${failureThresholdPercent}
      ORDER BY failure_rate DESC, total_operations DESC
    `;

    return problematicKeys.map((k) => k.key_id);
  } catch (error) {
    this.logger.error(
      `Failed to find problematic keys: ${error.message}`,
      error.stack,
    );
    return [];
  }
}
This identifies keys where:
  • At least 5 decryption operations have been attempted
  • Failure rate exceeds the threshold (default: 10%)

Monitoring Workflow

1

Query metrics summary

Retrieve metrics for the desired timeframe:
query {
  getEncryptionMetricsSummary(timeframeHours: 24)
}
2

Analyze failure rates

Check the failureRates array to identify operations with high failure rates.
3

Review top failures

Examine topFailures to understand common error patterns.
4

Identify problematic keys

Use the monitoring service to find keys that need rotation (programmatically).
5

Take action

  • Rotate keys with high failure rates
  • Investigate systemic issues
  • Alert on threshold breaches

Metrics Database Schema

model EncryptionMetric {
  id           String   @id @default(uuid())
  keyId        String?  @map("key_id")
  deviceId     String?  @map("device_id")
  operation    String   // 'encrypt' | 'decrypt' | 'generate'
  status       String   // 'success' | 'failure'
  errorReason  String?  @map("error_reason")
  duration     Int      // Operation duration in milliseconds
  timestamp    DateTime @default(now())

  @@index([timestamp])
  @@index([keyId])
  @@index([operation, status])
  @@map("EncryptionMetric")
}

Example Monitoring Dashboard

async function fetchMetrics(hours: number = 24) {
  const response = await fetch('/graphql', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      query: `
        query GetMetrics($hours: Float!) {
          getEncryptionMetricsSummary(timeframeHours: $hours)
        }
      `,
      variables: { hours }
    })
  });

  const { data } = await response.json();
  const metrics = JSON.parse(data.getEncryptionMetricsSummary);
  
  return metrics;
}

// Usage
const metrics = await fetchMetrics(24);
console.log('Metrics:', metrics.metrics);
console.log('Failure Rates:', metrics.failureRates);
console.log('Top Failures:', metrics.topFailures);

Alerting Strategies

Failure Rate Alerts

async function checkFailureRates() {
  const metrics = await fetchMetrics(1); // Last hour
  
  for (const rate of metrics.failureRates) {
    if (rate.failure_rate > 5) {
      await sendAlert({
        severity: 'warning',
        message: `High failure rate for ${rate.operation}: ${rate.failure_rate.toFixed(2)}%`,
        timeframe: '1 hour'
      });
    }
    
    if (rate.failure_rate > 10) {
      await sendAlert({
        severity: 'critical',
        message: `Critical failure rate for ${rate.operation}: ${rate.failure_rate.toFixed(2)}%`,
        timeframe: '1 hour'
      });
    }
  }
}

Performance Degradation Alerts

async function checkPerformance() {
  const metrics = await fetchMetrics(1);
  
  for (const metric of metrics.metrics) {
    if (metric.operation === 'decrypt' && metric.avg_duration > 50) {
      await sendAlert({
        severity: 'warning',
        message: `Slow decryption performance: ${metric.avg_duration.toFixed(2)}ms average`,
        timeframe: '1 hour'
      });
    }
  }
}

Best Practices

Regular Monitoring

Query metrics at regular intervals (e.g., every 5 minutes) to detect issues early.

Failure Thresholds

Set appropriate thresholds for alerts (e.g., >5% warning, >10% critical).

Key Rotation

Automatically rotate keys with consistently high failure rates.

Trend Analysis

Compare metrics across different timeframes to identify trends.

Common Failure Reasons

Error ReasonDescriptionAction
Encryption key has expiredKey exceeded its 30-day lifetimeRotate the key
Encryption key not foundKey was deleted or never existedGenerate new key
key_retrieval_failedDatabase error retrieving keyCheck database connectivity
Failed to decrypt dataCorrupted data or wrong keyVerify keyId matches
Encrypted data size exceeds limitPayload too largeReduce data size

Performance Metrics

Normal Operation Ranges

  • Key Generation: 30-100ms
  • Encryption (client-side): 1-5ms
  • Decryption (server-side): 5-20ms
If averages exceed these ranges, investigate:
  • Database performance
  • Server CPU usage
  • Network latency
  • Large payload sizes

Metrics Retention

Consider implementing a retention policy for metrics:
// Example: Delete metrics older than 90 days
async function cleanupOldMetrics() {
  const cutoffDate = new Date();
  cutoffDate.setDate(cutoffDate.getDate() - 90);

  await prismaService.encryptionMetric.deleteMany({
    where: {
      timestamp: {
        lt: cutoffDate
      }
    }
  });
}

Next Steps

Key Generation

Learn about key generation and rotation

Authentication

Implement secure authentication workflows

Build docs developers (and LLMs) love