Skip to main content

Overview

Fluxer uses SigNoz as its observability platform, providing unified traces, metrics, and logs in a single interface. Built on OpenTelemetry standards, it offers:
  • Distributed tracing - Track requests across microservices
  • Metrics collection - Monitor performance and resource usage
  • Log aggregation - Centralized logging with structured search
  • Custom dashboards - Build visualizations for key metrics
  • Alerting - Proactive notifications for anomalies
SigNoz is self-hosted, giving you complete control over telemetry data without third-party dependencies.

Architecture

SigNoz Stack Components

1

OpenTelemetry Collector

Receives telemetry data via OTLP (gRPC and HTTP) and forwards to ClickHouse.
  • Ports: 4317 (gRPC), 4318 (HTTP)
  • Replicas: 3 for high availability
  • Batch processing: 10k events per batch
2

ClickHouse

High-performance columnar database for storing traces, metrics, and logs.
  • Version: 25.5.6
  • Retention: Configurable per data type
  • Compression: Optimized for telemetry data
3

SigNoz UI

Web interface for querying and visualizing telemetry data.
  • Port: 8080
  • URL: signoz.fluxer.app (behind Caddy)
  • Authentication: Built-in user management
4

Zookeeper

Coordination service for ClickHouse cluster management.
  • Version: 3.7.1
  • Replicas: 1 (increase for production)

Deployment

Docker Swarm Stack

cd fluxer_devops/signoz

# Deploy with default settings
./deploy.sh

# Or specify version
export SIGNOZ_IMAGE_TAG=v0.108.0
export OTELCOL_TAG=v0.129.12
docker stack deploy -c compose.yaml fluxer-signoz

Environment Variables

.env
SIGNOZ_IMAGE_TAG=v0.108.0
OTELCOL_TAG=v0.129.12
LOW_CARDINAL_EXCEPTION_GROUPING=false

OpenTelemetry Collector Configuration

The OTel Collector processes and exports telemetry data:
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  
  prometheus:
    config:
      global:
        scrape_interval: 60s
      scrape_configs:
        - job_name: otel-collector
          static_configs:
            - targets:
                - localhost:8888

Instrumenting Fluxer Services

Node.js (fluxer_server, fluxer_api, fluxer_gateway)

Install OpenTelemetry packages:
pnpm add @opentelemetry/sdk-node \
         @opentelemetry/auto-instrumentations-node \
         @opentelemetry/exporter-trace-otlp-grpc \
         @opentelemetry/exporter-metrics-otlp-grpc
Create instrumentation file:
instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'fluxer-server',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || 'dev',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'production',
  }),
  traceExporter: new OTLPTraceExporter({
    url: 'grpc://otel-collector:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'grpc://otel-collector:4317',
    }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
Load before application:
package.json
{
  "scripts": {
    "start": "node --require ./instrumentation.js dist/index.js"
  }
}

Custom Spans

Add manual instrumentation for critical operations:
import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('fluxer-server');

async function sendMessage(channelId: string, content: string) {
  const span = tracer.startSpan('sendMessage');
  span.setAttributes({
    'channel.id': channelId,
    'message.length': content.length,
  });
  
  try {
    // Validate content
    await validateMessage(content);
    
    // Save to Cassandra
    const message = await db.messages.insert({
      channelId,
      content,
      timestamp: Date.now(),
    });
    
    // Publish to NATS
    await nats.publish(`channel.${channelId}.message`, message);
    
    span.setStatus({ code: SpanStatusCode.OK });
    return message;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
    throw error;
  } finally {
    span.end();
  }
}

Custom Metrics

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('fluxer-server');

const messageCounter = meter.createCounter('fluxer.messages.sent', {
  description: 'Total messages sent',
  unit: '1',
});

const messageHistogram = meter.createHistogram('fluxer.message.size', {
  description: 'Message size distribution',
  unit: 'bytes',
});

// Record metrics
messageCounter.add(1, {
  'channel.type': 'text',
  'guild.id': guildId,
});

messageHistogram.record(content.length, {
  'channel.type': 'text',
});

Prometheus Integration

SigNoz can scrape Prometheus metrics from instrumented services:

Service Discovery

Configure Docker Swarm labels for automatic scraping:
services:
  fluxer_api:
    deploy:
      labels:
        signoz.io/scrape: 'true'
        signoz.io/port: '9464'
        signoz.io/path: '/metrics'

Metrics Endpoint

Expose Prometheus metrics in your service:
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import express from 'express';

const prometheusExporter = new PrometheusExporter({
  port: 9464,
  endpoint: '/metrics',
});

const app = express();
// ... app routes

app.listen(8080);

Dashboards

Built-in Dashboards

SigNoz includes pre-built dashboards for:
  • APM - Application performance overview
  • Infrastructure - CPU, memory, disk, network
  • Database - Query performance and connection pools
  • Errors - Exception tracking and error rates

Custom Dashboards

Create custom dashboards for Fluxer-specific metrics:
1

Navigate to Dashboards

Open SigNoz UI → Dashboards → New Dashboard
2

Add Panels

Select visualization type:
  • Time series (line/area charts)
  • Bar charts
  • Pie charts
  • Value (single number)
  • Table
3

Configure Query

Use PromQL-like queries:
# Message send rate
rate(fluxer_messages_sent_total[5m])

# P95 message send latency
histogram_quantile(0.95, fluxer_message_send_duration_bucket)

# Active WebSocket connections
sum(fluxer_websocket_connections)
4

Save Dashboard

Export as JSON for version control:
cp dashboard.json fluxer_devops/signoz/dashboards/

Alerting

Alert Rules

Create alerts for critical conditions:
name: High Error Rate
query: rate(fluxer_errors_total[5m]) > 10
severity: critical

annotations:
  description: 'Error rate is {{ $value }}/s (threshold: 10/s)'

labels:
  service: fluxer-server
  team: backend

Notification Channels

Configure alerts to send notifications:
  • Slack - Post to #alerts channel
  • Email - Send to on-call team
  • Webhook - Integrate with PagerDuty, Opsgenie
  • Discord - Fluxer meta dogfooding!

Log Aggregation

Centralize logs from all Fluxer services:

Structured Logging

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  transport: {
    target: 'pino-opentelemetry-transport',
    options: {
      exporterUrl: 'http://otel-collector:4318/v1/logs',
    },
  },
});

logger.info({ userId, channelId }, 'User sent message');
logger.error({ error: err, userId }, 'Failed to send message');

Log Querying

Search logs in SigNoz UI:
-- Find all errors for user
service_name = 'fluxer-server' AND level = 'error' AND userId = '123456'

-- Find slow database queries
service_name = 'fluxer-server' AND attributes.db.duration > 1000

-- Find specific error messages
service_name = 'fluxer-api' AND body CONTAINS 'rate limit exceeded'

Performance Optimization

Data Retention

Configure retention policies to manage storage:
ClickHouse
-- Set 30-day retention for traces
ALTER TABLE signoz_traces.signoz_index_v2 
MODIFY TTL timestamp + INTERVAL 30 DAY;

-- Set 90-day retention for metrics
ALTER TABLE signoz_metrics.samples_v2 
MODIFY TTL timestamp + INTERVAL 90 DAY;

-- Set 7-day retention for logs
ALTER TABLE signoz_logs.logs 
MODIFY TTL timestamp + INTERVAL 7 DAY;

Sampling

Reduce trace volume with tail-based sampling:
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 1000
      
      - name: sample-others
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Troubleshooting

Check OTel Collector logs:
docker service logs fluxer-signoz_otel-collector -f
Verify connectivity:
# From application container
curl -v http://otel-collector:4318/v1/traces
Check firewall rules:
# Ensure ports 4317 and 4318 are accessible
Check query performance:
SELECT query, elapsed 
FROM system.processes 
ORDER BY elapsed DESC;
Optimize with materialized views:
CREATE MATERIALIZED VIEW signoz_traces.top_endpoints
ENGINE = SummingMergeTree()
ORDER BY (service_name, http_route, timestamp)
AS SELECT
  service_name,
  http_route,
  toStartOfMinute(timestamp) as timestamp,
  count() as count
FROM signoz_traces.signoz_index_v2
GROUP BY service_name, http_route, timestamp;
Clock skew - Ensure NTP is configured:
timedatectl set-ntp true
Context propagation - Verify trace context headers:
import { context, propagation } from '@opentelemetry/api';

const headers = {};
propagation.inject(context.active(), headers);
// Include headers in downstream requests

Best Practices

Use Consistent Attributes

Define standard attribute names across services:
  • user.id
  • guild.id
  • channel.id
  • message.id

Sample High-Volume Traces

Use tail-based sampling to keep errors and slow requests while sampling normal traffic.

Set Alert Thresholds

Base thresholds on historical data and percentiles, not absolute values.

Monitor the Monitor

Set up alerts for SigNoz/ClickHouse health and resource usage.

See Also

Build docs developers (and LLMs) love