Prometheus Monitoring Deep Dive

Prometheus is an open-source monitoring system designed for reliability and scalability. It collects metrics by scraping HTTP endpoints, stores them in a time-series database, and provides a powerful query language for analysis and alerting. Understanding Prometheus architecture and best practices helps you build effective monitoring for your applications.

The pull-based model is fundamental to Prometheus. Instead of applications pushing metrics, Prometheus periodically scrapes metric endpoints. This simplifies application code, enables service discovery, and allows Prometheus to detect when targets are down.

Metrics Types

Prometheus supports four metric types, each suited for different measurements.

Counters track cumulative values that only increase. Request counts, error counts, and bytes transferred are counters. Query rate of change using rate() or increase().

// Exposing counter metrics in PHP
class MetricsController extends Controller
{
    public function index(): Response
    {
        $metrics = [];

        // Counter: total requests
        $requestCount = Cache::get('metrics:request_count', 0);
        $metrics[] = "# HELP http_requests_total Total HTTP requests";
        $metrics[] = "# TYPE http_requests_total counter";
        $metrics[] = "http_requests_total{method=\"GET\",status=\"200\"} $requestCount";

        return response(implode("\n", $metrics))
            ->header('Content-Type', 'text/plain');
    }
}

Gauges track values that can go up or down. Queue depth, active connections, and temperature are gauges. Query current values directly.

// Gauge: current queue depth
$queueDepth = Queue::size('default');
$metrics[] = "# HELP queue_depth Current jobs in queue";
$metrics[] = "# TYPE queue_depth gauge";
$metrics[] = "queue_depth{queue=\"default\"} $queueDepth";

Histograms track the distribution of values. Request latency and response sizes benefit from histograms. They provide count, sum, and configurable buckets for calculating percentiles.

// Histogram: request duration
$buckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10];
$metrics[] = "# HELP http_request_duration_seconds Request duration";
$metrics[] = "# TYPE http_request_duration_seconds histogram";

foreach ($buckets as $bucket) {
    $count = Cache::get("metrics:duration_bucket:$bucket", 0);
    $metrics[] = "http_request_duration_seconds_bucket{le=\"$bucket\"} $count";
}
$metrics[] = "http_request_duration_seconds_bucket{le=\"+Inf\"} " . Cache::get('metrics:duration_count', 0);
$metrics[] = "http_request_duration_seconds_sum " . Cache::get('metrics:duration_sum', 0);
$metrics[] = "http_request_duration_seconds_count " . Cache::get('metrics:duration_count', 0);

Summaries are similar to histograms but calculate quantiles on the client side. They're less flexible but more accurate for specific quantiles. Generally prefer histograms for new implementations.

Instrumenting Applications

Good instrumentation captures what's happening inside your application. The RED method covers key metrics: Rate, Errors, and Duration.

class MetricsMiddleware
{
    public function handle(Request $request, Closure $next): Response
    {
        $start = microtime(true);

        $response = $next($request);

        $duration = microtime(true) - $start;
        $method = $request->method();
        $status = $response->status();
        $path = $this->normalizePath($request->path());

        // Increment request counter
        $this->incrementCounter("requests:$method:$path:$status");

        // Record duration in histogram buckets
        $this->recordHistogram("duration:$method:$path", $duration);

        return $response;
    }

    private function normalizePath(string $path): string
    {
        // Replace IDs with placeholders to control cardinality
        return preg_replace('/\/\d+/', '/:id', $path);
    }

    private function incrementCounter(string $key): void
    {
        Cache::increment("metrics:$key");
    }

    private function recordHistogram(string $key, float $value): void
    {
        Cache::increment("metrics:$key:count");
        Cache::increment("metrics:$key:sum", $value);

        $buckets = [0.01, 0.05, 0.1, 0.5, 1, 5];
        foreach ($buckets as $bucket) {
            if ($value <= $bucket) {
                Cache::increment("metrics:$key:bucket:$bucket");
            }
        }
    }
}

For production PHP applications, use dedicated libraries like promphp/prometheus_client_php that handle metric storage and exposition efficiently.

PromQL Fundamentals

PromQL (Prometheus Query Language) retrieves and transforms metrics. Understanding PromQL is essential for dashboards and alerts.

Instant vectors return the most recent value for each time series:

# All HTTP request counters
http_requests_total

# Filter by label
http_requests_total{status="500"}

# Regex matching
http_requests_total{path=~"/api/.*"}

Range vectors return values over a time range, used with functions:

# Request rate over 5 minutes
rate(http_requests_total[5m])

# Error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

Aggregation operators combine time series:

# Total requests per status code
sum by (status) (rate(http_requests_total[5m]))

# 99th percentile latency
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# Average across instances
avg without (instance) (process_cpu_seconds_total)

Alerting Rules

Prometheus alerting rules define conditions that trigger alerts. Alertmanager handles routing, grouping, and notification.

# prometheus/rules/application.yml
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Slow responses
      - alert: SlowResponses
        expr: |
          histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile latency above 1s"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      # Service down
      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

Alertmanager routes alerts to appropriate channels:

# alertmanager.yml
route:
  receiver: 'default'
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/...'
        channel: '#alerts'

Service Discovery

Prometheus discovers scrape targets dynamically. Static configuration works for small deployments, but dynamic discovery scales better.

# prometheus.yml
scrape_configs:
  # Static targets
  - job_name: 'api'
    static_configs:
      - targets: ['api-1:9090', 'api-2:9090']

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}:$2

  # EC2 service discovery
  - job_name: 'ec2'
    ec2_sd_configs:
      - region: us-east-1
        port: 9090
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: environment

Recording Rules

Recording rules precompute expensive queries, improving dashboard performance and enabling alerts on complex expressions.

groups:
  - name: aggregations
    rules:
      # Precompute request rate by service
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Precompute error ratio
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      # Precompute latency percentiles
      - record: job:http_latency:p99
        expr: |
          histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Use recording rules when queries take too long for dashboards, when the same expensive query appears in multiple places, or when you need to alert on aggregated data.

High Availability

Single Prometheus instances are single points of failure. For production, run multiple Prometheus instances scraping the same targets.

# Run two identical Prometheus instances
# Both scrape the same targets and evaluate the same rules
# Use external labels to distinguish them
global:
  external_labels:
    prometheus_replica: 'prometheus-1'

Thanos or Cortex provide long-term storage and global query across multiple Prometheus instances:

# Thanos sidecar uploads to object storage
- name: thanos-sidecar
  image: thanos:latest
  args:
    - sidecar
    - --prometheus.url=http://localhost:9090
    - --objstore.config-file=/etc/thanos/bucket.yml

Conclusion

Prometheus provides a robust foundation for application monitoring. The pull-based model and powerful query language enable flexible metrics collection and analysis. Proper instrumentation using the RED method captures application behavior. Alerting rules and Alertmanager provide reliable notification.

Start with basic metrics; request rate, error rate, and latency. Add application-specific metrics as you identify what's important to monitor. Use recording rules for expensive queries and dashboards. Run multiple instances for high availability. Prometheus scales with your monitoring needs.

Prometheus Monitoring Deep Dive

Metrics Types

Instrumenting Applications

PromQL Fundamentals

Alerting Rules

Service Discovery

Recording Rules

High Availability

Conclusion

Share this article

Related Articles

Memory Leaks in Production: Detection, Diagnosis, and Prevention

CDN Configuration: Beyond Just Caching Static Assets

Async Processing: Offloading Work to Improve Response Times

Need help with your project?

ScopeForged Assistant