SLO Guide | Defining and Measuring Service Reliability

Service Level Objectives (SLOs) define reliability targets that guide engineering decisions. Unlike vague goals like "high availability," SLOs provide measurable targets that balance reliability investment with feature development.

Understanding SLOs

The Hierarchy

SLOs exist within a hierarchy of reliability concepts. Understanding how they relate helps you implement them effectively.

SLA (Service Level Agreement)
↓ External contract with customers
↓ "99.9% uptime or credits issued"

SLO (Service Level Objective)
↓ Internal reliability target
↓ "99.95% successful requests"

SLI (Service Level Indicator)
↓ The actual measurement
↓ "Successful requests / Total requests"

Why SLOs Matter

Without SLOs:

"Is our service reliable enough?" → No answer
"Should we fix this reliability issue or build features?" → Endless debate

With SLOs:

"Are we meeting our 99.9% target?" → Check the dashboard
"We have error budget remaining" → Ship features
"Error budget exhausted" → Focus on reliability

Defining Good SLIs

Availability

Availability measures whether your service is responding successfully to requests. Be careful about what you count as "failed": client errors (4xx) are typically not failures from the service's perspective.

Availability = Successful requests / Total requests

Successful = HTTP 2xx, 3xx, 4xx (client errors are "successful" handling)
Failed = HTTP 5xx, timeouts, connection errors

Here is how you might implement availability tracking in your application. The key is measuring from the user's perspective, not just server health.

// Measure availability
class AvailabilityMetrics
{
    public function record(Response $response, float $duration): void
    {
        $success = $response->status() < 500 && $duration < 10;

        Metrics::increment('requests.total');
        if ($success) {
            Metrics::increment('requests.successful');
        }
    }

    public function getAvailability(string $period = '30d'): float
    {
        $total = Metrics::sum('requests.total', $period);
        $successful = Metrics::sum('requests.successful', $period);

        return $total > 0 ? ($successful / $total) * 100 : 100;
    }
}

Notice that the success check includes a duration threshold. A request that takes 10 seconds to return a 200 is not really successful from the user's perspective.

Latency

Latency SLIs measure how quickly your service responds. Percentiles are more meaningful than averages because they show the experience of typical users, not the mathematical mean.

Latency SLI = Requests below threshold / Total requests

Example: 95% of requests complete in < 200ms

This implementation tracks latency as a histogram, allowing you to calculate any percentile and measure SLO compliance.

class LatencyMetrics
{
    public function record(float $durationMs): void
    {
        Metrics::histogram('request.duration', $durationMs);
    }

    public function getPercentile(float $percentile, string $period = '30d'): float
    {
        return Metrics::percentile('request.duration', $percentile, $period);
    }

    public function getSloCompliance(float $threshold, string $period = '30d'): float
    {
        $total = Metrics::count('request.duration', $period);
        $fast = Metrics::countBelow('request.duration', $threshold, $period);

        return $total > 0 ? ($fast / $total) * 100 : 100;
    }
}

Correctness

Correctness measures whether your service returns the right answer, not just any answer. This is domain-specific and requires defining what "correct" means for your system.

Correctness = Correct responses / Total responses

Requires defining "correct" for your domain

For payment processing, correctness might mean the charge amount matches the request, the customer was properly billed, and the order was marked as paid.

// Example: Payment processing correctness
class PaymentCorrectnessMetrics
{
    public function record(Payment $payment, PaymentResult $result): void
    {
        Metrics::increment('payments.total');

        // Verify expected outcome
        $correct = $this->verifyCorrectness($payment, $result);

        if ($correct) {
            Metrics::increment('payments.correct');
        } else {
            Log::error('Payment correctness issue', [
                'payment_id' => $payment->id,
                'expected' => $payment->expected_state,
                'actual' => $result->state,
            ]);
        }
    }
}

Freshness

Freshness measures how quickly data changes propagate through your system. This is critical for systems with replication, caching, or eventual consistency.

Freshness = Data updates within threshold / Total updates

Example: 99% of data changes visible within 5 seconds

Setting SLO Targets

Start with Current Performance

Do not guess at what your SLO should be. Measure your current performance and set a target slightly better than where you are today.

Step 1: Measure current state
- Availability: 99.7%
- P95 latency: 450ms

Step 2: Set achievable target (slightly better)
- Availability SLO: 99.8%
- Latency SLO: 95% < 500ms

Step 3: Iterate based on user needs and cost

The Nines

Each additional "nine" of availability is exponentially harder and more expensive to achieve. Understand what your target actually means in terms of acceptable downtime.

Target	Downtime/year	Downtime/month
99%	3.65 days	7.3 hours
99.9%	8.76 hours	43.8 minutes
99.95%	4.38 hours	21.9 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

Each additional nine is exponentially harder and more expensive.

User-Centric Targets

Different services warrant different reliability targets based on user expectations and business impact.

Ask: "What do users actually need?"

API for web app:
- Availability: 99.9% (users retry)
- Latency: 95% < 500ms (acceptable UX)

Payment processing:
- Availability: 99.99% (money is critical)
- Correctness: 99.999% (wrong amounts are unacceptable)

Batch processing:
- Availability: 99% (retry tomorrow)
- Freshness: 95% within 1 hour

Error Budgets

Concept

The error budget is the complement of your SLO target. If your SLO is 99.9% availability, your error budget is 0.1% of requests that can fail without violating the SLO.

Error Budget = 100% - SLO Target

SLO: 99.9% availability
Error Budget: 0.1% of requests can fail

Calculating Budget

This service calculates how much error budget remains based on actual failures versus allowed failures. Track this continuously to make informed decisions.

class ErrorBudgetService
{
    public function calculate(string $period = '30d'): array
    {
        $sloTarget = 99.9;
        $totalRequests = Metrics::sum('requests.total', $period);
        $failedRequests = Metrics::sum('requests.failed', $period);

        $allowedFailures = $totalRequests * (100 - $sloTarget) / 100;
        $budgetRemaining = $allowedFailures - $failedRequests;
        $budgetPercentRemaining = ($budgetRemaining / $allowedFailures) * 100;

        return [
            'total_requests' => $totalRequests,
            'failed_requests' => $failedRequests,
            'allowed_failures' => $allowedFailures,
            'budget_remaining' => $budgetRemaining,
            'budget_percent' => $budgetPercentRemaining,
            'slo_met' => $failedRequests <= $allowedFailures,
        ];
    }
}

Budget-Based Decisions

Error budgets drive engineering priorities. When you have budget remaining, invest in features. When budget is low, focus on reliability.

Budget > 50% remaining:
→ Ship features aggressively
→ Run experiments
→ Accept some risk

Budget 20-50% remaining:
→ Normal development pace
→ Monitor closely
→ Fix known reliability issues

Budget < 20% remaining:
→ Slow down feature releases
→ Prioritize reliability work
→ Increase testing rigor

Budget exhausted:
→ Feature freeze
→ All hands on reliability
→ Incident review required for deploys

Multi-Window SLOs

Different Time Windows

A single SLO window can mask issues. Use multiple windows to catch both acute incidents and gradual degradation.

class MultiWindowSlo
{
    private array $windows = [
        'hourly' => 60,      // minutes
        'daily' => 1440,
        'weekly' => 10080,
        'monthly' => 43200,
    ];

    public function check(): array
    {
        $results = [];

        foreach ($this->windows as $name => $minutes) {
            $availability = $this->calculateAvailability($minutes);
            $target = $this->getTarget($name);

            $results[$name] = [
                'availability' => $availability,
                'target' => $target,
                'met' => $availability >= $target,
            ];
        }

        return $results;
    }

    private function getTarget(string $window): float
    {
        // Tighter targets for longer windows
        return match($window) {
            'hourly' => 99.0,   // Allow brief issues
            'daily' => 99.5,
            'weekly' => 99.8,
            'monthly' => 99.9,  // Primary SLO
        };
    }
}

Shorter windows have looser targets because brief incidents are acceptable. Longer windows enforce the overall SLO.

Burn Rate Alerts

Burn rate tells you how fast you are consuming error budget relative to your window duration. A burn rate of 1 means you will exactly exhaust your budget by the end of the window.

class BurnRateAlerting
{
    public function check(): ?Alert
    {
        $monthlyBudget = 0.001; // 0.1% for 99.9% SLO

        // Fast burn: consuming budget 14x faster than sustainable
        $hourlyErrorRate = $this->getErrorRate('1h');
        if ($hourlyErrorRate > $monthlyBudget * 14) {
            return new Alert('critical', 'Fast error budget burn detected');
        }

        // Slow burn: consuming budget 3x faster than sustainable
        $dailyErrorRate = $this->getErrorRate('24h');
        if ($dailyErrorRate > $monthlyBudget * 3) {
            return new Alert('warning', 'Elevated error rate detected');
        }

        return null;
    }
}

A 14x burn rate over an hour would exhaust your monthly budget in about two days, warranting a critical alert. A 3x burn rate is concerning but not immediately critical.

SLO Documentation

Template

Document your SLOs clearly so everyone understands what the targets are and what happens when they are missed.

# Service: User API

## SLOs

### Availability
- **Target**: 99.9% of requests successful
- **Measurement Window**: 30 days rolling
- **SLI**: (HTTP 2xx + 3xx + 4xx) / Total requests

### Latency
- **Target**: 95% of requests < 200ms
- **Measurement Window**: 30 days rolling
- **SLI**: Request duration histogram

## Error Budget Policy

| Budget Remaining | Action |
|-----------------|--------|
| > 50% | Normal development |
| 20-50% | Increased monitoring |
| < 20% | Reliability focus |
| Exhausted | Feature freeze |

## Stakeholders
- **Owner**: Backend Team
- **Escalation**: [oncall@example.com](mailto:oncall@example.com)

Monitoring Dashboard

Key Visualizations

Your SLO dashboard should show current status, historical trends, and error budget consumption at a glance.

# Grafana dashboard panels
panels:
  - title: "Current Availability"
    type: gauge
    query: |
      sum(rate(requests_successful[30d])) /
      sum(rate(requests_total[30d])) * 100

  - title: "Error Budget Remaining"
    type: gauge
    query: |
      (0.001 - (1 - availability_30d)) / 0.001 * 100

  - title: "Availability Over Time"
    type: graph
    queries:
      - label: "Actual"
        query: availability_hourly
      - label: "SLO Target"
        query: 99.9

  - title: "Burn Rate"
    type: graph
    query: |
      (1 - availability_1h) / (0.001 / 720) # 720 hours in 30 days

The burn rate visualization is particularly useful during incidents to understand if you are headed toward budget exhaustion.

Common Pitfalls

Too Many SLOs

More SLOs does not mean better reliability. Too many SLOs dilutes focus and creates alert fatigue.

Bad: 50 SLOs for one service
→ Nothing is prioritized
→ Alert fatigue

Good: 3-5 SLOs covering key user journeys
→ Clear priorities
→ Actionable alerts

Measuring the Wrong Thing

SLOs should measure user experience, not system internals. Low CPU usage means nothing if users are seeing errors.

Bad: "Server CPU < 80%"
→ Doesn't reflect user experience

Good: "99% of checkout completions < 3 seconds"
→ Directly measures user impact

Unrealistic Targets

Setting unachievable targets demoralizes teams and makes error budgets meaningless.

Bad: "99.999% availability" for a startup
→ Impossible to achieve
→ Permanently exhausted budget

Good: Start at current performance + 0.1%
→ Achievable improvement
→ Iterate upward

Conclusion

SLOs transform reliability from a vague goal into measurable targets. Define SLIs that reflect user experience, set achievable targets, and use error budgets to balance reliability with velocity. Start simple with availability and latency SLOs, add complexity as needed. The goal isn't perfect reliability;it's the right reliability for your users and business.

Service Level Objectives (SLOs) That Work