You can't fix what you can't see. When your application runs in production, you need visibility into what's happening. This guide covers the fundamentals of monitoring and observability;what to track, how to track it, and how to respond.
The Three Pillars
Observability rests on three pillars: logs, metrics, and traces. Each serves a different purpose.
Logs: Discrete events with details. "User 123 failed login at 14:32:01 - invalid password."
Metrics: Numerical measurements over time. "Request latency p99 is 450ms."
Traces: Request paths through distributed systems. "This request hit service A, then B, then the database, taking 230ms total."
You need all three. Logs tell you what happened. Metrics tell you if something's wrong. Traces tell you where in the system it's wrong.
What to Monitor and Why
Not everything matters. Focus on what affects users and business:
The Four Golden Signals (from Google SRE):
-
Latency: How long requests take. Track percentiles (p50, p95, p99), not just averages.
-
Traffic: Request volume. Helps you understand load and spot anomalies.
-
Errors: Failed requests (5xx errors, exceptions, business logic failures).
-
Saturation: How full your resources are. CPU, memory, disk, connection pools.
Business metrics often matter more than technical ones:
- Signups per hour
- Orders completed
- Revenue processed
- Active users
When these drop, something is wrong;even if your technical metrics look fine.
Structured Logging Best Practices
Unstructured logs are hard to analyze:
User login failed for john@example.com - bad password
Structured logs enable filtering and aggregation:
{
"timestamp": "2025-01-15T14:32:01Z",
"level": "warning",
"event": "login_failed",
"user_email": "john@example.com",
"reason": "invalid_password",
"ip": "192.168.1.1"
}
Logging best practices:
Use log levels appropriately:
- ERROR: Something failed that shouldn't have
- WARN: Something unexpected but handled
- INFO: Normal operations worth noting
- DEBUG: Detailed information for troubleshooting
Include context: Request ID, user ID, trace ID. Without context, logs are hard to correlate.
Don't log sensitive data: Passwords, tokens, personal information. Scrub or mask if needed.
Log at service boundaries: Incoming requests, outgoing calls, message consumption. This creates a clear trail.
Make logs searchable: Consistent field names across services let you query effectively.
Key Metrics for Web Applications
Track these at minimum:
Request metrics:
- Request count by endpoint
- Response time percentiles (p50, p95, p99)
- Error rate by type
- Request size and response size
System metrics:
- CPU utilization
- Memory usage
- Disk I/O and space
- Network traffic
Application metrics:
- Active connections/sessions
- Queue depth (background jobs)
- Cache hit rate
- Database connection pool utilization
Custom business metrics:
- Payments processed
- User actions completed
- Feature usage
Distributed Tracing Basics
In distributed systems, a single user request might touch many services. Traces show the complete picture.
A trace is a tree of spans. Each span represents an operation (HTTP call, database query, function execution) with:
- Start time and duration
- Service and operation name
- Tags (user ID, HTTP status)
- Parent span (what called this)
Trace: abc123
├── GET /api/orders (200ms)
│ ├── auth-service: validateToken (15ms)
│ ├── orders-db: findOrders (120ms)
│ └── pricing-service: calculateTotals (40ms)
│ └── redis: cache-get (2ms)
With tracing, you can see that the orders database is the bottleneck;without it, you'd be guessing.
Implementing tracing:
- Generate a trace ID at the edge
- Pass trace context through all service calls
- Each service creates spans for its operations
- Send spans to a collector (Jaeger, Zipkin, cloud provider)
OpenTelemetry is the standard for instrumentation. It supports all major languages and integrates with most backends.
Alerting Without Alert Fatigue
Bad alerting is worse than no alerting. Teams ignore alerts when too many are noise.
Alert on symptoms, not causes: Alert when users are affected (high error rate, slow responses), not on every CPU spike.
Set meaningful thresholds: "CPU over 80%" might be normal. "Error rate over 1%" is always worth investigating.
Include runbooks: Each alert should link to documentation on what to check and how to respond.
Use multiple severity levels:
- Critical: Page someone now (site is down)
- Warning: Investigate soon (degraded performance)
- Info: Awareness only (scheduled maintenance)
Reduce noise:
- Use hysteresis (must be over threshold for 5 minutes)
- Group related alerts
- Auto-resolve when conditions clear
- Review and tune regularly
On-call reality check: If you're paged more than once a week on average, you have too many alerts or too many problems.
Tool Options
Metrics and monitoring:
- Datadog: Full-featured, expensive, great UX
- Grafana + Prometheus: Open source, self-hosted or cloud
- New Relic: APM focus, good for application performance
- CloudWatch: AWS native, good if you're all-in on AWS
Logging:
- Datadog Logs
- Elastic Stack (ELK)
- Splunk: Enterprise grade
- CloudWatch Logs
Tracing:
- Jaeger: Open source
- Zipkin: Open source
- Datadog APM
- AWS X-Ray
For small teams, pick one integrated platform (Datadog, New Relic, or cloud-native tools). Managing separate systems adds overhead.
Starting Simple and Scaling Up
You don't need everything on day one. Start with:
Level 1: Basic visibility
- Application logs to a central location
- Key metrics dashboards (request rate, errors, latency)
- Basic alerts for critical issues
Level 2: Improved diagnostics
- Structured logging with request correlation
- APM with transaction traces
- Better alerting with runbooks
Level 3: Full observability
- Distributed tracing across services
- Custom business metrics
- SLOs and error budgets
- Automated anomaly detection
Move through levels as your system and team grow. A startup doesn't need what a large enterprise needs.
Practical Steps
-
Add structured logging if you haven't already. Include request IDs.
-
Set up basic dashboards showing request rate, error rate, and latency.
-
Create alerts for critical conditions: site down, error rate spike, key business metrics drop.
-
Add APM/tracing when debugging performance issues becomes painful.
-
Review and iterate monthly. Remove noisy alerts. Add metrics you wish you had during incidents.
Conclusion
Observability isn't about collecting data;it's about understanding your system. When something breaks at 3 AM, you want to know immediately, see exactly what's wrong, and find the cause quickly.
Start with the basics: logs, key metrics, and alerts that matter. Expand as you need more visibility. The goal is answering "what's wrong and where" in minutes, not hours.
Invest in observability before you need it. The cost of building it is much lower than the cost of debugging blind.