Alerting That Works: Reducing Noise and Catching Real Incidents

The Alerting Problem

Bad alerting is worse than no alerting. When engineers receive dozens of low-signal alerts per week, they develop alert fatigue: they start ignoring pages, delays in acknowledgment become normal, and when a real incident occurs, the alert is buried in noise.

The goal of alerting is simple: wake someone up exactly when something real is wrong, and never wake them up for anything else. Getting there requires discipline about what you alert on and how those alerts are configured.

Alert on Symptoms, Not Causes

The most important principle in alerting: alert on user-visible symptoms, not on internal system states.

Cause-based alerts (fragile, noisy):

CPU usage above 80%
Database connections above 90 of 100
Queue depth above 1000 jobs
Memory usage above 85%

Symptom-based alerts (actionable, reliable):

Error rate above 1% for the past 5 minutes
99th percentile response time above 2 seconds
Successful payment rate below 95%
Users unable to log in

The difference: symptom-based alerts fire when users are experiencing something bad. Cause-based alerts fire when something internal looks unusual—even if users aren't affected at all. High CPU with normal response times and error rates isn't an incident; it's a capacity observation.

This is why the Four Golden Signals (covered in a separate article) form the foundation of good alerting: error rate, latency, traffic, and saturation are all user-observable.

The Anatomy of a Good Alert

Every alert should answer these questions before it's created:

What is broken? Specific, not vague ("invoices are failing to generate" not "something is wrong")
Who is affected? All users? Premium users? A specific region?
How do I investigate? Link directly to the relevant dashboard or runbook
What action should I take? Clear first steps for the on-call engineer
Can this wait until morning? If yes, don't page; send a Slack message instead

Here's a well-structured alert definition in Prometheus alerting rules:

groups:
  - name: billing
    rules:
      - alert: PaymentSuccessRateLow
        expr: |
          sum(rate(payment_completed_total{status="success"}[5m]))
          /
          sum(rate(payment_completed_total[5m]))
          < 0.90
        for: 3m
        labels:
          severity: critical
          team: billing
        annotations:
          summary: "Payment success rate below 90%"
          description: |
            Payment success rate is {{ humanizePercentage $value }} over the last 5 minutes.
            This directly impacts revenue. Investigate immediately.
          runbook_url: "https://runbooks.internal/billing/payment-failures"
          dashboard_url: "https://grafana.internal/d/billing-overview"

The for: 3m is critical: the condition must persist for 3 minutes before firing. This eliminates alerts from transient spikes.

Severity Levels That Mean Something

Define severity levels that drive consistent behavior:

P1 - Critical: Service is down or severely degraded. Users cannot complete core workflows. Page immediately, regardless of time of day. Expected response: within 5 minutes.

P2 - High: Significant degradation affecting a subset of users or a non-critical workflow. Page during business hours; wake up on-call after hours only if trending toward P1. Expected response: within 30 minutes.

P3 - Medium: Degradation present but manageable. No paging; create a ticket and address during business hours. Expected response: within 24 hours.

P4 - Low: Informational. Something worth knowing about but not urgent. Slack notification or weekly review. Expected response: next sprint.

Map your alerts to these levels explicitly. An alert with no severity level defaults to P1 in most people's instinct, which is how alert fatigue starts.

Setting Thresholds Statistically

Arbitrary thresholds ("alert if error rate > 1%") lead to both false positives and false negatives. Better thresholds are based on historical baselines and statistical deviation.

Percentile-based thresholds: Instead of "alert if P99 latency > 500ms", alert when P99 latency is 3x higher than its 7-day average for the same time of day:

- alert: LatencyAnomalous
  expr: |
    (
      histogram_quantile(0.99, rate(http_request_duration_bucket[5m]))
      /
      histogram_quantile(0.99, rate(http_request_duration_bucket[5m] offset 7d))
    ) > 3
  for: 5m

This alert is resilient to normal variation between weekdays and weekends, and between peak and off-peak hours.

Burn rate alerts: Instead of point-in-time threshold alerts, burn rate alerts calculate how quickly you're consuming your error budget. A high burn rate means you're on track to exhaust your SLO before the window ends:

# Alert if error budget burn rate would exhaust it in < 1 hour
- alert: ErrorBudgetBurnRateCritical
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[1h]))
    /
    sum(rate(http_requests_total[1h]))
    > (1 - 0.999) * 14.4   # 14.4 = hourly burn multiplier for 30-day SLO

Grouping and Routing

Not all alerts should go to the same place. Route alerts to the right team with the right urgency:

# Alertmanager routing configuration
route:
  receiver: default
  group_by: [alertname, cluster]
  group_wait: 30s       # Wait to collect related alerts
  group_interval: 5m    # Re-notify interval for ongoing issues
  repeat_interval: 4h   # Don't repeat resolved alerts for 4 hours

  routes:
    - matchers:
        - team = billing
        - severity = critical
      receiver: billing-pagerduty

    - matchers:
        - team = billing
        - severity = high
      receiver: billing-slack

    - matchers:
        - team = infrastructure
        - severity =~ critical|high
      receiver: infra-pagerduty

receivers:
  - name: billing-pagerduty
    pagerduty_configs:
      - routing_key: ${PAGERDUTY_BILLING_KEY}
        description: '{{ .GroupLabels.alertname }}'

  - name: billing-slack
    slack_configs:
      - channel: '#billing-alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Grouping is essential: if a deployment triggers 20 alerts simultaneously, Alertmanager should send one notification about a group of related alerts, not 20 individual pages.

Runbooks: The Most Underrated Alert Component

Every P1 and P2 alert must have a runbook linked in its annotations. A runbook answers:

How do I verify the alert is real and not a false positive?
What are the most common causes of this alert?
What are the immediate mitigation steps?
Who should I escalate to if the issue persists?

A basic runbook format:

# Payment Success Rate Low

## Alert Condition
Payment success rate below 90% for more than 3 minutes.

## Verification
1. Check the Billing Overview dashboard
2. Look at recent Stripe API error logs: `CloudWatch > /app/billing > filter ERROR`
3. Check Stripe status page: https://status.stripe.com

## Common Causes
- Stripe API outage (check status page)
- Invalid API key after rotation (check Secrets Manager)
- Webhook endpoint certificate expired
- Surge of declined cards (check fraud patterns)

## Immediate Mitigation
- If Stripe outage: enable fallback payment capture mode
- If API key issue: rotate via `./scripts/rotate-stripe-key.sh`
- If webhook issue: check certificate expiry in ACM

## Escalation
- After 15 minutes without resolution: page @billing-lead
- After 30 minutes: page @cto

Runbooks turn 2am incidents from "confused engineer fumbling in the dark" into "engineer following a clear checklist."

Measuring Alert Quality

Track these metrics to know if your alerting is improving:

Alert-to-incident ratio: What percentage of alerts correspond to real incidents? Under 80% means too many false positives.
Mean time to acknowledge (MTTA): Long MTTA suggests alert fatigue or unclear ownership.
Alert volume per week: This should be trending down over time as false positives are eliminated.
False positive rate by alert: Identify your noisiest alerts and fix or remove them.

Review these metrics in a monthly on-call retrospective.

Practical Takeaways

Alert on user-visible symptoms (error rate, latency) not internal metrics (CPU, queue depth)
Use for duration in alert rules to eliminate transient spikes
Define explicit severity levels and consistent response expectations for each
Group related alerts so one deployment doesn't generate 20 pages
Write runbooks for every P1/P2 alert and link them in alert annotations
Track alert quality metrics monthly and eliminate your noisiest false positives

Need help building reliable systems? We help teams architect software that scales. scopeforged.com

Alerting That Works: Reducing Noise and Catching Real Incidents

The Alerting Problem

Alert on Symptoms, Not Causes

The Anatomy of a Good Alert

Severity Levels That Mean Something

Setting Thresholds Statistically

Grouping and Routing

Runbooks: The Most Underrated Alert Component

Measuring Alert Quality

Practical Takeaways

Share this article

Related Articles

The Four Golden Signals: What Every Service Should Monitor

Log Aggregation at Scale: ELK vs Loki vs CloudWatch

Observability-Driven Development: Instrument Before You Ship

Need help with your project?

ScopeForged Assistant