SigNoz
SigNoz is an open-source observability platform that provides distributed tracing, metrics, and logs in a unified interface. It’s a self-hosted alternative to DataDog or New Relic, built on OpenTelemetry standards.Why SigNoz?
SigNoz offers several advantages for ML applications:- Open-source: No vendor lock-in, full control over your data
- OpenTelemetry native: Works with standard instrumentation
- Unified platform: Traces, metrics, and logs in one place
- Cost-effective: Self-hosted means no per-seat or per-event pricing
- Query flexibility: Use ClickHouse for powerful analytics
SigNoz is built on ClickHouse, which provides excellent performance for storing and querying large volumes of trace data.
Architecture
SigNoz consists of several components:- OTEL Collector: Receives, processes, and exports telemetry data
- ClickHouse: Columnar database for storing traces and metrics
- Query Service: API for querying data from ClickHouse
- Frontend: React-based UI for visualization
Prerequisites
Before installing SigNoz, ensure you have:- A Kubernetes cluster (kind, minikube, or cloud-based)
kubectlconfigured to access your clusterhelm3.x installed- At least 4GB of RAM available for SigNoz components
Installation
Step 1: Enable Volume Expansion
SigNoz requires persistent storage with volume expansion enabled:Step 2: Add SigNoz Helm Repository
Step 3: Install SigNoz
- Creates PersistentVolumeClaims for ClickHouse and Kafka
- Deploys all components (collector, query service, frontend, etc.)
- Waits for all pods to be ready
Step 4: Verify Installation
Check that all pods are running:my-release-signoz-clickhousemy-release-signoz-otel-collectormy-release-signoz-query-servicemy-release-signoz-frontendmy-release-signoz-kafkamy-release-signoz-zookeeper
Accessing SigNoz
Port Forwarding
For local access, use kubectl port-forward:Remote Access
For remote access (e.g., from a development machine to a remote cluster):Configuring Applications
Environment Variables
Configure your application to send traces to SigNoz:Python Configuration
For Python applications using OpenLLMetry:Using the SigNoz UI
First Login
- Navigate to
http://localhost:3301 - Create an account (stored locally in ClickHouse)
- Complete the onboarding wizard
Viewing Traces
The Traces page shows all incoming traces:- Timeline view: See when requests occurred
- List view: Browse traces with filtering
- Trace detail: Click a trace to see the full span tree
Traces are organized by service name, which comes from the
OTEL_SERVICE_NAME environment variable.Key Features
Filtering and Search
Filtering and Search
Use the query builder to filter traces by:
- Service name
- HTTP status code
- Duration (e.g., slower than 1s)
- Custom attributes (e.g., user ID, model name)
- Tags and metadata
Span Details
Span Details
Click on any span to see:
- Attributes (key-value pairs)
- Events (timestamped logs within the span)
- Parent-child relationships
- Timing information
- Model name
- Token counts (prompt, completion, total)
- Temperature and other parameters
- Prompt and response content (if logged)
Service Map
Service Map
The service map shows:
- All services in your system
- Dependencies between services
- Request rates and error rates
- Latency percentiles
Metrics Dashboard
Metrics Dashboard
SigNoz automatically generates metrics from traces:
- Request rate (requests per second)
- Error rate (percentage of failed requests)
- Duration (p50, p90, p95, p99)
Example: Monitoring LLM Applications
Instrumenting Your App
What You’ll See in SigNoz
After running your application:- Service:
text2sqlappears in the services list - Traces: Each invocation creates a trace with spans for:
- The
generate_sqlworkflow - The OpenAI API call
- Network requests
- The
- Attributes: Token counts, model name, latency
- Metrics: Aggregate statistics over time
Advanced Configuration
Custom Values
Create avalues.yaml file to customize the installation:
Sampling Configuration
For high-volume applications, configure sampling:Troubleshooting
Pods not starting
Pods not starting
Check resource availability:Common issues:
- Insufficient memory or CPU
- PersistentVolume not provisioning
- Image pull errors
No traces appearing
No traces appearing
-
Verify port-forward is active:
-
Check application configuration:
TRACELOOP_BASE_URLis set correctly- Application can reach the collector
-
Check collector logs:
ClickHouse errors
ClickHouse errors
ClickHouse may run out of disk space. Check usage:Resize the PVC if needed:
Cleanup
To uninstall SigNoz:Best Practices
Use Service Names
Set descriptive
OTEL_SERVICE_NAME values to distinguish between services and environments.Monitor Resource Usage
ClickHouse and Kafka can consume significant resources. Monitor and adjust limits as needed.
Configure Retention
Set data retention policies to prevent unlimited growth. Default is 7 days.
Use Sampling in Production
Enable sampling for high-volume services to reduce storage costs and overhead.
Additional Resources
- SigNoz Documentation
- Python Instrumentation Guide
- Kubernetes Deployment Guide
- ClickHouse Configuration
Next Steps
Set Up Grafana
Configure Grafana and Prometheus for Kubernetes metrics monitoring