Overview
Penn Labs uses Datadog for centralized logging from Kubernetes applications. Logs are automatically collected from all containers and made available for search and analysis.Logging Architecture
Datadog Agent
The Datadog agent runs as a DaemonSet on every Kubernetes node, collecting logs from all containers. Deployment (terraform/modules/base_cluster/datadog.tf):
terraform/helm/datadog.yaml):
- Container Collection: Automatically collects logs from all containers
- Namespace: Deployed in
monitoringnamespace - Custom Image: Uses
pennlabs/datadog-agentwith custom configurations - API Key: Stored in
datadogKubernetes secret
Log Collection
Automatic Collection
Datadog automatically collects logs from:- All application pods
- System pods (kube-system)
- Init containers
- Sidecar containers
Log Sources
Application Logs:Log Metadata
Datadog automatically adds metadata to logs:- Pod name - Which pod generated the log
- Namespace - Kubernetes namespace
- Container name - Container within the pod
- Node - Which Kubernetes node
- Service - Inferred from pod labels
- Image - Docker image and tag
- Labels - All Kubernetes pod labels
Accessing Logs
Datadog UI
Primary Method: Use the Datadog web interface for log search and analysis. Access:- Login to Datadog (credentials in Bitwarden)
- Navigate to Logs section
- Use time picker to select range
- Common ranges: 15m, 1h, 4h, 1d, 7d
- Custom ranges available
- Click “Live Tail” to stream logs in real-time
- Useful for debugging active issues
- Apply filters to narrow down logs
kubectl Logs
Quick Method: Use kubectl for immediate log access. View pod logs:- Only shows recent logs (limited buffer)
- No search or filtering (use grep)
- Doesn’t persist after pod deletion
- No aggregation across pods
Log Best Practices
Application Logging
DO:- ✓ Use structured logging (JSON format when possible)
- ✓ Include context (user ID, request ID, etc.)
- ✓ Use appropriate log levels (DEBUG, INFO, WARNING, ERROR)
- ✓ Log important events (user actions, errors, performance)
- ✓ Include timestamps (automatic in most frameworks)
- ✓ Sanitize sensitive data (passwords, tokens, PII)
- ✗ Log sensitive information (passwords, API keys, SSNs)
- ✗ Log too verbosely in production (debug spam)
- ✗ Log binary data or large payloads
- ✗ Use print() instead of logging framework
- ✗ Ignore errors or swallow exceptions silently
Structured Logging Example
Python (Django):Log Levels
Use appropriate severity levels: DEBUG:- Detailed information for diagnosing problems
- Only in development, not production
- Example: Variable values, function entry/exit
- General informational messages
- Normal application flow
- Example: User logged in, job completed
- Unexpected but handled situations
- Doesn’t stop execution
- Example: Deprecated API used, retrying failed request
- Error that affects functionality
- May be recoverable
- Example: Database query failed, API timeout
- Severe error, application may crash
- Requires immediate attention
- Example: Database unreachable, out of memory
Searching and Filtering
Datadog Search Queries
Basic Search:Creating Log Views
Save common searches:- Create a search query
- Click “Save View”
- Name the view (e.g., “Penn Clubs Errors”)
- Set default time range and columns
- Share with team if needed
- Application Errors (all services, status:error)
- Slow Requests (duration > 2s)
- User Login Events (contains “login”)
- Database Errors (contains “database” and status:error)
Monitoring and Alerts
Log-based Metrics
Create metrics from log patterns: Example: Error Rate- Go to Logs → Generate Metrics
- Filter:
status:error - Group by:
service - Metric name:
logs.error.count - Use in dashboards and alerts
- Filter:
@duration:>2000 - Metric name:
logs.slow_request.count - Alert when count > threshold
Log-based Alerts
Create alerts for log patterns:- Go to Monitors → New Monitor
- Select “Logs”
- Define search query:
- Set alert conditions:
- Trigger when > 10 errors in 5 minutes
- Configure notification:
- Send to Slack channel
- Tag on-call engineer
- Add context:
- Link to runbook
- Include log samples
Datadog Configuration
Custom Checks
Datadog can monitor custom application metrics: Cert Manager Check (terraform/helm/datadog.yaml):
Environment Variables
Datadog agent loads secrets from thedatadog Kubernetes secret:
DD_API_KEY- Datadog API keyPOSTGRES_HOST- Database host for monitoringPOSTGRES_PASSWORD- Database password for datadog user
Custom Image
Penn Labs uses a custom Datadog agent image:Common Use Cases
Debugging Application Errors
-
Find the error in Datadog:
- Check recent logs for context (before/after error)
-
Look for patterns:
- Is it happening repeatedly?
- Same user/endpoint?
- Correlated with deployment?
- Check stack trace in log details
- Reproduce locally if possible
Investigating Slow Requests
-
Search for slow requests:
- Group by endpoint to find bottlenecks
-
Check for:
- Slow database queries
- External API timeouts
- Large data processing
-
Correlate with metrics:
- CPU/memory usage
- Database load
- Traffic patterns
Tracking User Activity
-
Search by user ID:
- Sort by timestamp to see activity timeline
- Filter by log level to focus on errors or info
- Export logs if needed for investigation
Monitoring Deployments
- Filter by recent time (last 15 minutes)
-
Watch for errors during deployment:
- Check for new error patterns not seen before
- Compare error rate before/after deployment
- Rollback if needed (see Rollback Procedures)
Troubleshooting
Logs Not Appearing in Datadog
Symptom: Application logs not visible in Datadog Solutions:-
Check Datadog agent is running:
-
Check agent logs for errors:
-
Verify API key is correct:
-
Check container logs are being written to stdout/stderr:
-
Ensure
containerCollectAll: truein Datadog config
Missing Log Context
Symptom: Logs lack expected fields (user ID, request ID, etc.) Solutions:- Verify application is using structured logging
- Check logging configuration in application
- Ensure extra fields are passed to logger:
- Update logging format to include context
High Log Volume
Symptom: Too many logs, hard to find relevant information Solutions:- Reduce debug logging in production
- Filter out health check logs
- Sample high-volume logs (log every Nth request)
- Use log level filtering
- Create indexes for important logs only
Logs Cut Off
Symptom: Log messages are truncated Solutions:- Datadog has a max log size (256 KB by default)
- Split large logs into multiple entries
- Log summaries instead of full payloads
- Store large data elsewhere, log reference ID