Monitoring and Observability

Overview

Fluxer uses SigNoz as its observability platform, providing unified traces, metrics, and logs in a single interface. Built on OpenTelemetry standards, it offers:

Distributed tracing - Track requests across microservices
Metrics collection - Monitor performance and resource usage
Log aggregation - Centralized logging with structured search
Custom dashboards - Build visualizations for key metrics
Alerting - Proactive notifications for anomalies

SigNoz is self-hosted, giving you complete control over telemetry data without third-party dependencies.

Architecture

SigNoz Stack Components

OpenTelemetry Collector

Receives telemetry data via OTLP (gRPC and HTTP) and forwards to ClickHouse.

Ports: 4317 (gRPC), 4318 (HTTP)
Replicas: 3 for high availability
Batch processing: 10k events per batch

ClickHouse

High-performance columnar database for storing traces, metrics, and logs.

Version: 25.5.6
Retention: Configurable per data type
Compression: Optimized for telemetry data

SigNoz UI

Web interface for querying and visualizing telemetry data.

Port: 8080
URL: signoz.fluxer.app (behind Caddy)
Authentication: Built-in user management

Zookeeper

Coordination service for ClickHouse cluster management.

Version: 3.7.1
Replicas: 1 (increase for production)

Deployment

Docker Swarm Stack

cd fluxer_devops/signoz

# Deploy with default settings
./deploy.sh

# Or specify version
export SIGNOZ_IMAGE_TAG=v0.108.0
export OTELCOL_TAG=v0.129.12
docker stack deploy -c compose.yaml fluxer-signoz

Environment Variables

.env

SIGNOZ_IMAGE_TAG=v0.108.0
OTELCOL_TAG=v0.129.12
LOW_CARDINAL_EXCEPTION_GROUPING=false

OpenTelemetry Collector Configuration

The OTel Collector processes and exports telemetry data:

Receivers
Processors
Exporters
Pipelines

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  prometheus:
    config:
      global:
        scrape_interval: 60s
      scrape_configs:
        - job_name: otel-collector
          static_configs:
            - targets:
                - localhost:8888

processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
  
  resourcedetection:
    detectors: [env, system]
    timeout: 2s
  
  signozspanmetrics/delta:
    metrics_exporter: signozclickhousemetrics
    metrics_flush_interval: 60s
    latency_histogram_buckets:
      [100us, 1ms, 2ms, 6ms, 10ms, 50ms, 100ms, 250ms, 
       500ms, 1000ms, 1400ms, 2000ms, 5s, 10s, 20s, 40s, 60s]

exporters:
  clickhousetraces:
    datasource: tcp://clickhouse:9000/signoz_traces
    use_new_schema: true
  
  signozclickhousemetrics:
    dsn: tcp://clickhouse:9000/signoz_metrics
  
  clickhouselogsexporter:
    dsn: tcp://clickhouse:9000/signoz_logs
    timeout: 10s
    use_new_schema: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [signozspanmetrics/delta, batch]
      exporters: [clickhousetraces, signozmeter]
    
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [signozclickhousemetrics, signozmeter]
    
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [clickhouselogsexporter, signozmeter]

Instrumenting Fluxer Services

Node.js (fluxer_server, fluxer_api, fluxer_gateway)

Install OpenTelemetry packages:

pnpm add @opentelemetry/sdk-node \
         @opentelemetry/auto-instrumentations-node \
         @opentelemetry/exporter-trace-otlp-grpc \
         @opentelemetry/exporter-metrics-otlp-grpc

Create instrumentation file:

instrumentation.ts

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'fluxer-server',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || 'dev',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'grpc://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'grpc://otel-collector:4317',
    }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Load before application:

package.json

{
  "scripts": {
    "start": "node --require ./instrumentation.js dist/index.js"
  }
}

Custom Spans

Add manual instrumentation for critical operations:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('fluxer-server');

async function sendMessage(channelId: string, content: string) {
  const span = tracer.startSpan('sendMessage');
  span.setAttributes({
    'channel.id': channelId,
    'message.length': content.length,
  });
  
  try {
    // Validate content
    await validateMessage(content);
    
    // Save to Cassandra
    const message = await db.messages.insert({
      channelId,
      content,
      timestamp: Date.now(),
    });
    
    // Publish to NATS
    await nats.publish(`channel.${channelId}.message`, message);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return message;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
    throw error;
  } finally {
    span.end();
  }
}

Custom Metrics

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('fluxer-server');

const messageCounter = meter.createCounter('fluxer.messages.sent', {
  description: 'Total messages sent',
  unit: '1',
});

const messageHistogram = meter.createHistogram('fluxer.message.size', {
  description: 'Message size distribution',
  unit: 'bytes',
});

// Record metrics
messageCounter.add(1, {
  'channel.type': 'text',
  'guild.id': guildId,
});

messageHistogram.record(content.length, {
  'channel.type': 'text',
});

Prometheus Integration

SigNoz can scrape Prometheus metrics from instrumented services:

Service Discovery

Configure Docker Swarm labels for automatic scraping:

services:
  fluxer_api:
    deploy:
      labels:
        signoz.io/scrape: 'true'
        signoz.io/port: '9464'
        signoz.io/path: '/metrics'

Metrics Endpoint

Expose Prometheus metrics in your service:

import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import express from 'express';

const prometheusExporter = new PrometheusExporter({
  port: 9464,
  endpoint: '/metrics',
});

const app = express();
// ... app routes

app.listen(8080);

Dashboards

Built-in Dashboards

SigNoz includes pre-built dashboards for:

APM - Application performance overview
Infrastructure - CPU, memory, disk, network
Database - Query performance and connection pools
Errors - Exception tracking and error rates

Custom Dashboards

Create custom dashboards for Fluxer-specific metrics:

Navigate to Dashboards

Open SigNoz UI → Dashboards → New Dashboard

Add Panels

Select visualization type:

Time series (line/area charts)
Bar charts
Pie charts
Value (single number)
Table

Configure Query

Use PromQL-like queries:

# Message send rate
rate(fluxer_messages_sent_total[5m])

# P95 message send latency
histogram_quantile(0.95, fluxer_message_send_duration_bucket)

# Active WebSocket connections
sum(fluxer_websocket_connections)

Save Dashboard

Export as JSON for version control:

cp dashboard.json fluxer_devops/signoz/dashboards/

Alerting

Alert Rules

Create alerts for critical conditions:

High Error Rate
Database Latency
Memory Usage

name: High Error Rate
query: rate(fluxer_errors_total[5m]) > 10
severity: critical

annotations:
  description: 'Error rate is {{ $value }}/s (threshold: 10/s)'

labels:
  service: fluxer-server
  team: backend

name: High Cassandra Latency
query: histogram_quantile(0.95, cassandra_query_duration_bucket) > 100
severity: warning

annotations:
  description: 'P95 Cassandra latency is {{ $value }}ms'

name: High Memory Usage
query: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
severity: warning

annotations:
  description: 'Container memory usage is {{ $value | humanizePercentage }}'

Notification Channels

Configure alerts to send notifications:

Slack - Post to #alerts channel
Email - Send to on-call team
Webhook - Integrate with PagerDuty, Opsgenie
Discord - Fluxer meta dogfooding!

Log Aggregation

Centralize logs from all Fluxer services:

Structured Logging

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  transport: {
    target: 'pino-opentelemetry-transport',
    options: {
      exporterUrl: 'http://otel-collector:4318/v1/logs',
    },
  },
});

logger.info({ userId, channelId }, 'User sent message');
logger.error({ error: err, userId }, 'Failed to send message');

Log Querying

Search logs in SigNoz UI:

-- Find all errors for user
service_name = 'fluxer-server' AND level = 'error' AND userId = '123456'

-- Find slow database queries
service_name = 'fluxer-server' AND attributes.db.duration > 1000

-- Find specific error messages
service_name = 'fluxer-api' AND body CONTAINS 'rate limit exceeded'

Performance Optimization

Data Retention

Configure retention policies to manage storage:

ClickHouse

-- Set 30-day retention for traces
ALTER TABLE signoz_traces.signoz_index_v2 
MODIFY TTL timestamp + INTERVAL 30 DAY;

-- Set 90-day retention for metrics
ALTER TABLE signoz_metrics.samples_v2 
MODIFY TTL timestamp + INTERVAL 90 DAY;

-- Set 7-day retention for logs
ALTER TABLE signoz_logs.logs 
MODIFY TTL timestamp + INTERVAL 7 DAY;

Sampling

Reduce trace volume with tail-based sampling:

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      
      - name: sample-others
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Troubleshooting

No data in SigNoz

Check OTel Collector logs:

docker service logs fluxer-signoz_otel-collector -f

Verify connectivity:

# From application container
curl -v http://otel-collector:4318/v1/traces

Check firewall rules:

# Ensure ports 4317 and 4318 are accessible

High ClickHouse CPU usage

Check query performance:

SELECT query, elapsed 
FROM system.processes 
ORDER BY elapsed DESC;

Optimize with materialized views:

CREATE MATERIALIZED VIEW signoz_traces.top_endpoints
ENGINE = SummingMergeTree()
ORDER BY (service_name, http_route, timestamp)
AS SELECT
  service_name,
  http_route,
  toStartOfMinute(timestamp) as timestamp,
  count() as count
FROM signoz_traces.signoz_index_v2
GROUP BY service_name, http_route, timestamp;

Traces missing spans

Clock skew - Ensure NTP is configured:

timedatectl set-ntp true

Context propagation - Verify trace context headers:

import { context, propagation } from '@opentelemetry/api';

const headers = {};
propagation.inject(context.active(), headers);
// Include headers in downstream requests

Best Practices

Use Consistent Attributes

Define standard attribute names across services:

user.id
guild.id
channel.id
message.id

Sample High-Volume Traces

Use tail-based sampling to keep errors and slow requests while sampling normal traffic.

Set Alert Thresholds

Base thresholds on historical data and percentiles, not absolute values.

Monitor the Monitor

Set up alerts for SigNoz/ClickHouse health and resource usage.

Infrastructure

Monitoring and Observability

Overview

Architecture

SigNoz Stack Components

Deployment

Docker Swarm Stack

Environment Variables

OpenTelemetry Collector Configuration

Instrumenting Fluxer Services

Node.js (fluxer_server, fluxer_api, fluxer_gateway)

Custom Spans

Custom Metrics

Prometheus Integration

Service Discovery

Metrics Endpoint

Dashboards

Built-in Dashboards

Custom Dashboards

Alerting

Alert Rules

Notification Channels

Log Aggregation

Structured Logging

Log Querying

Performance Optimization

Data Retention

Sampling

Troubleshooting

Best Practices

Use Consistent Attributes

Sample High-Volume Traces

Set Alert Thresholds

Monitor the Monitor

See Also

Build docs developers (and LLMs) love

Infrastructure

​Overview

​Architecture

​SigNoz Stack Components

​Deployment

​Docker Swarm Stack

​Environment Variables

​OpenTelemetry Collector Configuration

​Instrumenting Fluxer Services

​Node.js (fluxer_server, fluxer_api, fluxer_gateway)

​Custom Spans

​Custom Metrics

​Prometheus Integration

​Service Discovery

​Metrics Endpoint

​Dashboards

​Built-in Dashboards

​Custom Dashboards

​Alerting

​Alert Rules

​Notification Channels

​Log Aggregation

​Structured Logging

​Log Querying

​Performance Optimization

​Data Retention

​Sampling

​Troubleshooting

​Best Practices

Use Consistent Attributes

Sample High-Volume Traces

Set Alert Thresholds

Monitor the Monitor

​See Also

Build docs developers (and LLMs) love

Overview

Architecture

SigNoz Stack Components

Deployment

Docker Swarm Stack

Environment Variables

OpenTelemetry Collector Configuration

Instrumenting Fluxer Services

Node.js (fluxer_server, fluxer_api, fluxer_gateway)

Custom Spans

Custom Metrics

Prometheus Integration

Service Discovery

Metrics Endpoint

Dashboards

Built-in Dashboards

Custom Dashboards

Alerting

Alert Rules

Notification Channels

Log Aggregation

Structured Logging

Log Querying

Performance Optimization

Data Retention

Sampling

Troubleshooting

Best Practices

See Also