SLIs, SLOs, and Error Budgets: A Practical Guide

Why You Need SLOs

Without SLOs, reliability discussions are vague. "The site needs to be reliable" means nothing actionable. "We need 99.9% availability" is a number, but it doesn't say what availability means, how it's measured, or what happens when you miss it.

SLOs create a shared language between engineering and the business for talking about reliability. They answer: "How reliable do we need to be, and what does it cost us to get there?"

Service Level Indicators (SLIs)

An SLI is a quantitative measure of some aspect of the service you provide. It's a ratio: the proportion of events that are "good" out of all events.

Common SLIs:

Availability: Proportion of requests that receive a successful response (non-5xx).

Availability SLI = successful_requests / total_requests

Latency: Proportion of requests that complete faster than a threshold.

Latency SLI = requests_under_200ms / total_requests

Quality: Proportion of requests that return correct results (not degraded/cached/partial).

Freshness: Proportion of reads that see data less than N minutes old.

Choose SLIs that directly measure user experience. "CPU utilization" is not an SLI—users don't experience CPU. "Invoice generation success rate" is an SLI—users definitely experience invoice generation failures.

Service Level Objectives (SLOs)

An SLO is a target value for an SLI over a measurement window. SLOs have two components: the target and the window.

SLO: 99.9% of requests succeed over a rolling 30-day window

This means: in any 30-day period, fewer than 0.1% of requests can fail.

Defining your first SLOs:

Start by measuring your current SLIs. Don't set an SLO lower than your current performance—that's not a target, it's a floor. But also don't set an SLO higher than you need to; over-engineering reliability is expensive.

For a typical B2B SaaS application:

SLI	SLO Target	Window
Availability	99.9%	30 days
P99 latency < 1s	95%	30 days
Invoice generation success	99.5%	30 days
Email delivery success	98%	30 days

Different features deserve different SLOs. Your payment API warrants higher reliability than your report export endpoint.

Error Budgets: The Practical Magic

An error budget is the maximum amount of downtime or failure permitted by your SLO.

For a 99.9% availability SLO over 30 days:

Error budget = (1 - 0.999) * 30 days * 24 hours * 60 minutes
Error budget = 0.001 * 43,200 minutes
Error budget = 43.2 minutes per month

You can fail for 43.2 minutes per month and still meet your SLO. That's your error budget.

Error budgets transform the reliability conversation from "we need zero downtime" (impossible) to "we have 43 minutes of failure budget this month—how do we want to spend it?"

How error budgets change behavior:

When the error budget is healthy (most of it remaining), you can afford risk: deploy more features, run experiments, upgrade infrastructure. The data tells you there's room.

When the error budget is nearly exhausted, you slow down: fewer deployments, focus on stability improvements, root cause analysis on recent incidents.

This is objective. It removes the political tension between "we need to ship features" and "we need to be stable." Both are true; the error budget quantifies the tradeoff.

Implementing SLI Measurement in Practice

In Prometheus, measure your availability SLI:

# Define a recording rule for the SLI
groups:
  - name: sli_rules
    interval: 1m
    rules:
      # Count successful requests
      - record: sli:http_requests:success_rate1h
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))

      # Count fast requests (latency SLI)
      - record: sli:http_request_duration:fast_rate1h
        expr: |
          sum(rate(http_request_duration_bucket{le="0.2"}[1h]))
          /
          sum(rate(http_request_duration_count[1h]))

Calculate your 30-day error budget consumption:

# Error budget consumed (%)
1 - (
  sum(rate(http_requests_total{status!~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) / (1 - 0.999)

When this value approaches 1.0 (100%), you've consumed your entire error budget.

In Laravel: Application-Level SLI Tracking

For application-specific SLIs (invoice generation, email delivery), emit custom metrics:

class InvoiceGenerationService
{
    public function generate(Invoice $invoice): GeneratedInvoice
    {
        $start = microtime(true);

        try {
            $result = $this->doGenerate($invoice);

            // Increment success counter
            app(MetricsClient::class)->increment('invoice_generation_total', [
                'status' => 'success',
            ]);

            app(MetricsClient::class)->histogram(
                'invoice_generation_duration_seconds',
                microtime(true) - $start
            );

            return $result;

        } catch (\Exception $e) {
            // Increment failure counter
            app(MetricsClient::class)->increment('invoice_generation_total', [
                'status'     => 'failure',
                'error_type' => get_class($e),
            ]);

            throw $e;
        }
    }
}

Your SLI for invoice generation:

sum(rate(invoice_generation_total{status="success"}[30d]))
/
sum(rate(invoice_generation_total[30d]))

Burn Rate Alerts

Simple threshold alerts on SLI values aren't ideal for error budget management. Burn rate alerts tell you how quickly you're consuming your budget, enabling faster response.

A 1x burn rate means you'll exactly consume your budget at the end of the window. A 14.4x burn rate means you'll consume it in 1/14.4 of the window (about 2 days for a 30-day window).

# Alert at different burn rates for different urgencies
- alert: ErrorBudgetBurnRateCritical
  expr: |
    (
      1 - sum(rate(http_requests_total{status!~"5.."}[1h]))
          / sum(rate(http_requests_total[1h]))
    ) / (1 - 0.999) > 14.4
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning at 14x rate - exhausted in 2 days"

- alert: ErrorBudgetBurnRateHigh
  expr: |
    (
      1 - sum(rate(http_requests_total{status!~"5.."}[6h]))
          / sum(rate(http_requests_total[6h]))
    ) / (1 - 0.999) > 6
  for: 15m
  labels:
    severity: high
  annotations:
    summary: "Error budget burning at 6x rate - exhausted in ~5 days"

Service Level Agreements (SLAs) vs SLOs

SLAs are commitments to customers with contractual consequences. SLOs are internal targets. The relationship matters:

SLO should be stricter than SLA. If your SLA promises 99.5% availability, your SLO should target 99.9%. This gives you an internal buffer—your SLO alerts fire before you breach the SLA, giving you time to respond.

SLO (internal target): 99.9% availability
SLA (customer commitment): 99.5% availability
Buffer: 0.4% = ~3 extra hours of incident budget per month

Never set your SLO equal to your SLA. You'll be in constant SLA breach discussions.

Building an SLO Dashboard

Your SLO dashboard should show at a glance:

Current SLI value vs. target
Error budget remaining (as percentage and absolute time)
Error budget burn rate (last 1h, 6h, 24h)
Historical SLI trend (last 30 days)

In Grafana, create a dashboard with these panels for each of your SLOs. Display the error budget remaining prominently so on-call engineers and engineering managers can see the reliability health of the system without digging into metrics.

Starting Small

Don't try to define SLOs for every endpoint on day one. Start with:

Pick one critical user journey (login, payment, your core feature)
Define the SLI (usually availability + latency)
Measure your current performance for 2-4 weeks
Set an SLO based on what you observed, minus a small buffer
Build error budget dashboards and burn rate alerts
Run your first error budget review meeting

SLOs are most valuable when they change behavior. If the error budget dashboard doesn't influence deployment decisions or sprint planning, the SLO is just a number.

Practical Takeaways

SLIs measure user-facing behavior as a ratio of good events to total events
SLOs are targets for SLIs over a time window; start with what you can actually measure
Error budgets quantify acceptable risk and resolve the tension between shipping features and maintaining stability
Use burn rate alerts instead of threshold alerts to detect fast-moving incidents early
Set internal SLOs stricter than external SLAs to maintain a buffer
Review error budget consumption in sprint planning to make reliability-feature tradeoffs explicit

Need help building reliable systems? We help teams architect software that scales. scopeforged.com

SLIs, SLOs, and Error Budgets: A Practical Guide

Why You Need SLOs

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Error Budgets: The Practical Magic

Implementing SLI Measurement in Practice

In Laravel: Application-Level SLI Tracking

Burn Rate Alerts

Service Level Agreements (SLAs) vs SLOs

Building an SLO Dashboard

Starting Small

Practical Takeaways

Share this article

Related Articles

The Four Golden Signals: What Every Service Should Monitor

Log Aggregation at Scale: ELK vs Loki vs CloudWatch

Observability-Driven Development: Instrument Before You Ship

Need help with your project?

ScopeForged Assistant