Observability Pipeline Design

Philip Rehberger Feb 24, 2026 6 min read

Build scalable observability infrastructure. Collect, process, and store telemetry data efficiently.

Observability Pipeline Design

Observability pipelines collect, process, and route telemetry data from applications to analysis and storage systems. As applications grow in scale and complexity, the volume of logs, metrics, and traces grows exponentially. Well-designed pipelines handle this growth while providing flexibility to route data to different destinations based on content and priority.

A naive approach—shipping everything directly to a single destination—works initially but becomes problematic at scale. Network bandwidth, storage costs, and query performance all suffer. Observability pipelines address these challenges through intelligent processing, filtering, and routing.

Pipeline Architecture

Observability pipelines typically have three stages: collection, processing, and delivery. Each stage can scale independently and provides different capabilities.

Collection gathers telemetry from applications. Agents, sidecars, or library instrumentation push data to collectors. Collection should be lightweight and reliable—losing telemetry defeats its purpose.

Processing transforms, enriches, filters, and routes data. This stage adds context, removes noise, samples high-volume data, and directs data to appropriate destinations.

Delivery sends processed data to storage and analysis systems. Different data types go to appropriate backends: metrics to time-series databases, logs to search engines, traces to trace stores.

# OpenTelemetry Collector pipeline configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  filelog:
    include: [/var/log/app/*.log]
    operators:
      - type: json_parser
      - type: severity_parser
        parse_from: attributes.level

processors:
  batch:
    timeout: 5s
    send_batch_size: 10000

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000

  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
      - key: cluster
        value: us-east-1
        action: upsert

  filter:
    error_mode: ignore
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_INFO'  # Drop debug logs

exporters:
  otlp/traces:
    endpoint: jaeger:4317
    tls:
      insecure: true

  prometheus:
    endpoint: 0.0.0.0:8889

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/traces]

    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [prometheus]

    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, filter, batch, attributes]
      exporters: [loki]

Data Enrichment

Raw telemetry often lacks context needed for effective analysis. Enrichment adds metadata that helps with filtering, grouping, and correlation.

# Add Kubernetes metadata to telemetry
processors:
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.pod.name
        - k8s.node.name
      labels:
        - tag_name: app
          key: app.kubernetes.io/name
        - tag_name: version
          key: app.kubernetes.io/version

  resource:
    attributes:
      - key: service.instance.id
        from_attribute: k8s.pod.name
        action: upsert

Enrichment can also add business context:

// Application-side enrichment
class TelemetryEnricher
{
    public function enrichSpan(Span $span, Request $request): void
    {
        // Add user context
        if ($user = $request->user()) {
            $span->setAttribute('user.id', $user->id);
            $span->setAttribute('user.tier', $user->subscription_tier);
            $span->setAttribute('organization.id', $user->organization_id);
        }

        // Add request context
        $span->setAttribute('request.route', $request->route()?->getName());
        $span->setAttribute('request.client_ip', $request->ip());

        // Add feature flags
        $span->setAttribute('feature.new_checkout', Feature::active('new_checkout'));
    }
}

Filtering and Sampling

High-volume telemetry requires intelligent reduction. Not all data has equal value—debug logs during normal operation and successful health checks provide little insight.

# Filter low-value data
processors:
  filter:
    error_mode: ignore
    logs:
      log_record:
        # Drop debug logs in production
        - 'severity_number < SEVERITY_NUMBER_INFO'
        # Drop health check logs
        - 'attributes["http.route"] == "/health"'
        # Drop successful static file requests
        - 'attributes["http.route"] == "/static/*" and attributes["http.status_code"] < 400'

    traces:
      span:
        # Drop internal health check traces
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/ready"'

Sampling reduces volume while preserving representativeness. Tail-based sampling keeps interesting traces (errors, slow requests) while sampling routine requests:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      # Always keep error traces
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces
      - name: latency
        type: latency
        latency:
          threshold_ms: 1000

      # Sample 10% of normal traces
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Routing and Multi-Destination Delivery

Different telemetry serves different purposes and may need different destinations. Security logs go to SIEM systems. Performance metrics go to monitoring dashboards. Debug logs might go to cheaper cold storage.

# Route based on content
processors:
  routing:
    from_attribute: log.type
    table:
      - value: security
        exporters: [splunk, s3]
      - value: audit
        exporters: [elasticsearch_audit]
      - value: application
        exporters: [loki]

exporters:
  splunk:
    endpoint: https://splunk.example.com
    token: ${SPLUNK_TOKEN}

  elasticsearch_audit:
    endpoint: https://audit-es.example.com
    index: audit-logs

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

  s3:
    bucket: security-logs
    region: us-east-1

Fan-out to multiple destinations enables different use cases:

service:
  pipelines:
    logs/application:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki, s3_archive]  # Real-time and archive

    logs/security:
      receivers: [otlp]
      processors: [batch]
      exporters: [splunk, s3_archive, elasticsearch]  # SIEM, archive, search

Backpressure and Reliability

Observability pipelines must handle traffic spikes without losing data. Backpressure mechanisms prevent overwhelming downstream systems.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 500

exporters:
  otlp:
    endpoint: backend:4317
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000

Buffer data locally when destinations are unavailable:

exporters:
  file:
    path: /var/otel/buffer
    rotation:
      max_megabytes: 100
      max_days: 1
      max_backups: 10

extensions:
  file_storage:
    directory: /var/otel/storage
    timeout: 10s

Scaling Pipelines

Horizontal scaling handles increased volume. Deploy multiple collector instances behind a load balancer or use collector pools.

# Kubernetes deployment with multiple collectors
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otel-collector
  template:
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:latest
          resources:
            requests:
              cpu: 500m
              memory: 1Gi
            limits:
              cpu: 2
              memory: 4Gi
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318

For high-volume deployments, use a tiered architecture with edge collectors forwarding to central processors:

Application → Edge Collector (per node) → Gateway Collector (regional) → Backend

Monitoring the Pipeline

Observability pipelines need their own monitoring. Track data flow, processing latency, and error rates.

service:
  telemetry:
    logs:
      level: info
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Key metrics to monitor:

  • otelcol_receiver_accepted_spans: Inbound trace data
  • otelcol_processor_dropped_spans: Data loss in processing
  • otelcol_exporter_sent_spans: Successful delivery
  • otelcol_exporter_send_failed_spans: Delivery failures
  • otelcol_processor_batch_timeout_trigger_send: Batch timing

Conclusion

Observability pipelines are essential for managing telemetry at scale. Collect data efficiently with lightweight agents. Process data to add context, filter noise, and sample intelligently. Route data to appropriate destinations based on type and priority.

Design for reliability with backpressure handling and buffering. Scale horizontally to handle growth. Monitor the pipeline itself to ensure telemetry flows reliably. A well-designed pipeline reduces costs, improves query performance, and ensures you have the data needed when investigating issues.

Share this article

Related Articles

SLIs, SLOs, and Error Budgets: A Practical Guide

SLIs, SLOs, and Error Budgets: A Practical Guide

SLIs, SLOs, and error budgets give reliability a language that engineers and business stakeholders can reason about together. Here's how to implement them practically, not just theoretically.

Mar 15, 2026

Need help with your project?

Let's discuss how we can help you build reliable software.