Application Monitoring Guide | Logs, Metrics, and Traces

You can't fix what you can't see. When your application runs in production, you need visibility into what's happening. This guide covers the fundamentals of monitoring and observability;what to track, how to track it, and how to respond.

The Three Pillars

Observability rests on three pillars: logs, metrics, and traces. Each serves a different purpose.

Logs: Discrete events with details. "User 123 failed login at 14:32:01 - invalid password."

Metrics: Numerical measurements over time. "Request latency p99 is 450ms."

Traces: Request paths through distributed systems. "This request hit service A, then B, then the database, taking 230ms total."

You need all three. Logs tell you what happened. Metrics tell you if something's wrong. Traces tell you where in the system it's wrong.

What to Monitor and Why

Not everything matters. Focus on what affects users and business:

The Four Golden Signals (from Google SRE):

Latency: How long requests take. Track percentiles (p50, p95, p99), not just averages.
Traffic: Request volume. Helps you understand load and spot anomalies.
Errors: Failed requests (5xx errors, exceptions, business logic failures).
Saturation: How full your resources are. CPU, memory, disk, connection pools.

Business metrics often matter more than technical ones:

Signups per hour
Orders completed
Revenue processed
Active users

When these drop, something is wrong;even if your technical metrics look fine.

Structured Logging Best Practices

Unstructured logs are hard to analyze:

User login failed for john@example.com - bad password

Structured logs enable filtering and aggregation:

{
  "timestamp": "2025-01-15T14:32:01Z",
  "level": "warning",
  "event": "login_failed",
  "user_email": "john@example.com",
  "reason": "invalid_password",
  "ip": "192.168.1.1"
}

Logging best practices:

Use log levels appropriately:

ERROR: Something failed that shouldn't have
WARN: Something unexpected but handled
INFO: Normal operations worth noting
DEBUG: Detailed information for troubleshooting

Include context: Request ID, user ID, trace ID. Without context, logs are hard to correlate.

Don't log sensitive data: Passwords, tokens, personal information. Scrub or mask if needed.

Log at service boundaries: Incoming requests, outgoing calls, message consumption. This creates a clear trail.

Make logs searchable: Consistent field names across services let you query effectively.

Key Metrics for Web Applications

Track these at minimum:

Request metrics:

Request count by endpoint
Response time percentiles (p50, p95, p99)
Error rate by type
Request size and response size

System metrics:

CPU utilization
Memory usage
Disk I/O and space
Network traffic

Application metrics:

Active connections/sessions
Queue depth (background jobs)
Cache hit rate
Database connection pool utilization

Custom business metrics:

Payments processed
User actions completed
Feature usage

Distributed Tracing Basics

In distributed systems, a single user request might touch many services. Traces show the complete picture.

A trace is a tree of spans. Each span represents an operation (HTTP call, database query, function execution) with:

Start time and duration
Service and operation name
Tags (user ID, HTTP status)
Parent span (what called this)

Trace: abc123
├── GET /api/orders (200ms)
│   ├── auth-service: validateToken (15ms)
│   ├── orders-db: findOrders (120ms)
│   └── pricing-service: calculateTotals (40ms)
│       └── redis: cache-get (2ms)

With tracing, you can see that the orders database is the bottleneck;without it, you'd be guessing.

Implementing tracing:

Generate a trace ID at the edge
Pass trace context through all service calls
Each service creates spans for its operations
Send spans to a collector (Jaeger, Zipkin, cloud provider)

OpenTelemetry is the standard for instrumentation. It supports all major languages and integrates with most backends.

Alerting Without Alert Fatigue

Bad alerting is worse than no alerting. Teams ignore alerts when too many are noise.

Alert on symptoms, not causes: Alert when users are affected (high error rate, slow responses), not on every CPU spike.

Set meaningful thresholds: "CPU over 80%" might be normal. "Error rate over 1%" is always worth investigating.

Include runbooks: Each alert should link to documentation on what to check and how to respond.

Use multiple severity levels:

Critical: Page someone now (site is down)
Warning: Investigate soon (degraded performance)
Info: Awareness only (scheduled maintenance)

Reduce noise:

Use hysteresis (must be over threshold for 5 minutes)
Group related alerts
Auto-resolve when conditions clear
Review and tune regularly

On-call reality check: If you're paged more than once a week on average, you have too many alerts or too many problems.

Tool Options

Metrics and monitoring:

Datadog: Full-featured, expensive, great UX
Grafana + Prometheus: Open source, self-hosted or cloud
New Relic: APM focus, good for application performance
CloudWatch: AWS native, good if you're all-in on AWS

Logging:

Datadog Logs
Elastic Stack (ELK)
Splunk: Enterprise grade
CloudWatch Logs

Tracing:

Jaeger: Open source
Zipkin: Open source
Datadog APM
AWS X-Ray

For small teams, pick one integrated platform (Datadog, New Relic, or cloud-native tools). Managing separate systems adds overhead.

Starting Simple and Scaling Up

You don't need everything on day one. Start with:

Level 1: Basic visibility

Application logs to a central location
Key metrics dashboards (request rate, errors, latency)
Basic alerts for critical issues

Level 2: Improved diagnostics

Structured logging with request correlation
APM with transaction traces
Better alerting with runbooks

Level 3: Full observability

Distributed tracing across services
Custom business metrics
SLOs and error budgets
Automated anomaly detection

Move through levels as your system and team grow. A startup doesn't need what a large enterprise needs.

Practical Steps

Add structured logging if you haven't already. Include request IDs.
Set up basic dashboards showing request rate, error rate, and latency.
Create alerts for critical conditions: site down, error rate spike, key business metrics drop.
Add APM/tracing when debugging performance issues becomes painful.
Review and iterate monthly. Remove noisy alerts. Add metrics you wish you had during incidents.

Conclusion

Observability isn't about collecting data;it's about understanding your system. When something breaks at 3 AM, you want to know immediately, see exactly what's wrong, and find the cause quickly.

Start with the basics: logs, key metrics, and alerts that matter. Expand as you need more visibility. The goal is answering "what's wrong and where" in minutes, not hours.

Invest in observability before you need it. The cost of building it is much lower than the cost of debugging blind.

Monitoring and Observability Fundamentals

The Three Pillars

What to Monitor and Why

Structured Logging Best Practices

Key Metrics for Web Applications

Distributed Tracing Basics

Alerting Without Alert Fatigue

Tool Options

Starting Simple and Scaling Up

Practical Steps

Conclusion

Share this article

Related Articles

Container Orchestration Platform Comparison

Infrastructure Testing Strategies

Kubernetes Operators Deep Dive

Need help with your project?