Observability pipelines collect, process, and route telemetry data from applications to analysis and storage systems. As applications grow in scale and complexity, the volume of logs, metrics, and traces grows exponentially. Well-designed pipelines handle this growth while providing flexibility to route data to different destinations based on content and priority.
A naive approach—shipping everything directly to a single destination—works initially but becomes problematic at scale. Network bandwidth, storage costs, and query performance all suffer. Observability pipelines address these challenges through intelligent processing, filtering, and routing.
Pipeline Architecture
Observability pipelines typically have three stages: collection, processing, and delivery. Each stage can scale independently and provides different capabilities.
Collection gathers telemetry from applications. Agents, sidecars, or library instrumentation push data to collectors. Collection should be lightweight and reliable—losing telemetry defeats its purpose.
Processing transforms, enriches, filters, and routes data. This stage adds context, removes noise, samples high-volume data, and directs data to appropriate destinations.
Delivery sends processed data to storage and analysis systems. Different data types go to appropriate backends: metrics to time-series databases, logs to search engines, traces to trace stores.
# OpenTelemetry Collector pipeline configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
filelog:
include: [/var/log/app/*.log]
operators:
- type: json_parser
- type: severity_parser
parse_from: attributes.level
processors:
batch:
timeout: 5s
send_batch_size: 10000
memory_limiter:
check_interval: 1s
limit_mib: 1000
attributes:
actions:
- key: environment
value: production
action: upsert
- key: cluster
value: us-east-1
action: upsert
filter:
error_mode: ignore
logs:
log_record:
- 'severity_number < SEVERITY_NUMBER_INFO' # Drop debug logs
exporters:
otlp/traces:
endpoint: jaeger:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [otlp/traces]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [prometheus]
logs:
receivers: [otlp, filelog]
processors: [memory_limiter, filter, batch, attributes]
exporters: [loki]
Data Enrichment
Raw telemetry often lacks context needed for effective analysis. Enrichment adds metadata that helps with filtering, grouping, and correlation.
# Add Kubernetes metadata to telemetry
processors:
k8sattributes:
auth_type: serviceAccount
passthrough: false
extract:
metadata:
- k8s.namespace.name
- k8s.deployment.name
- k8s.pod.name
- k8s.node.name
labels:
- tag_name: app
key: app.kubernetes.io/name
- tag_name: version
key: app.kubernetes.io/version
resource:
attributes:
- key: service.instance.id
from_attribute: k8s.pod.name
action: upsert
Enrichment can also add business context:
// Application-side enrichment
class TelemetryEnricher
{
public function enrichSpan(Span $span, Request $request): void
{
// Add user context
if ($user = $request->user()) {
$span->setAttribute('user.id', $user->id);
$span->setAttribute('user.tier', $user->subscription_tier);
$span->setAttribute('organization.id', $user->organization_id);
}
// Add request context
$span->setAttribute('request.route', $request->route()?->getName());
$span->setAttribute('request.client_ip', $request->ip());
// Add feature flags
$span->setAttribute('feature.new_checkout', Feature::active('new_checkout'));
}
}
Filtering and Sampling
High-volume telemetry requires intelligent reduction. Not all data has equal value—debug logs during normal operation and successful health checks provide little insight.
# Filter low-value data
processors:
filter:
error_mode: ignore
logs:
log_record:
# Drop debug logs in production
- 'severity_number < SEVERITY_NUMBER_INFO'
# Drop health check logs
- 'attributes["http.route"] == "/health"'
# Drop successful static file requests
- 'attributes["http.route"] == "/static/*" and attributes["http.status_code"] < 400'
traces:
span:
# Drop internal health check traces
- 'attributes["http.route"] == "/health"'
- 'attributes["http.route"] == "/ready"'
Sampling reduces volume while preserving representativeness. Tail-based sampling keeps interesting traces (errors, slow requests) while sampling routine requests:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
# Always keep error traces
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces
- name: latency
type: latency
latency:
threshold_ms: 1000
# Sample 10% of normal traces
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
Routing and Multi-Destination Delivery
Different telemetry serves different purposes and may need different destinations. Security logs go to SIEM systems. Performance metrics go to monitoring dashboards. Debug logs might go to cheaper cold storage.
# Route based on content
processors:
routing:
from_attribute: log.type
table:
- value: security
exporters: [splunk, s3]
- value: audit
exporters: [elasticsearch_audit]
- value: application
exporters: [loki]
exporters:
splunk:
endpoint: https://splunk.example.com
token: ${SPLUNK_TOKEN}
elasticsearch_audit:
endpoint: https://audit-es.example.com
index: audit-logs
loki:
endpoint: http://loki:3100/loki/api/v1/push
s3:
bucket: security-logs
region: us-east-1
Fan-out to multiple destinations enables different use cases:
service:
pipelines:
logs/application:
receivers: [otlp]
processors: [batch]
exporters: [loki, s3_archive] # Real-time and archive
logs/security:
receivers: [otlp]
processors: [batch]
exporters: [splunk, s3_archive, elasticsearch] # SIEM, archive, search
Backpressure and Reliability
Observability pipelines must handle traffic spikes without losing data. Backpressure mechanisms prevent overwhelming downstream systems.
processors:
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 500
exporters:
otlp:
endpoint: backend:4317
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
Buffer data locally when destinations are unavailable:
exporters:
file:
path: /var/otel/buffer
rotation:
max_megabytes: 100
max_days: 1
max_backups: 10
extensions:
file_storage:
directory: /var/otel/storage
timeout: 10s
Scaling Pipelines
Horizontal scaling handles increased volume. Deploy multiple collector instances behind a load balancer or use collector pools.
# Kubernetes deployment with multiple collectors
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
spec:
replicas: 3
selector:
matchLabels:
app: otel-collector
template:
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:latest
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 2
memory: 4Gi
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318
For high-volume deployments, use a tiered architecture with edge collectors forwarding to central processors:
Application → Edge Collector (per node) → Gateway Collector (regional) → Backend
Monitoring the Pipeline
Observability pipelines need their own monitoring. Track data flow, processing latency, and error rates.
service:
telemetry:
logs:
level: info
metrics:
level: detailed
address: 0.0.0.0:8888
Key metrics to monitor:
otelcol_receiver_accepted_spans: Inbound trace dataotelcol_processor_dropped_spans: Data loss in processingotelcol_exporter_sent_spans: Successful deliveryotelcol_exporter_send_failed_spans: Delivery failuresotelcol_processor_batch_timeout_trigger_send: Batch timing
Conclusion
Observability pipelines are essential for managing telemetry at scale. Collect data efficiently with lightweight agents. Process data to add context, filter noise, and sample intelligently. Route data to appropriate destinations based on type and priority.
Design for reliability with backpressure handling and buffering. Scale horizontally to handle growth. Monitor the pipeline itself to ensure telemetry flows reliably. A well-designed pipeline reduces costs, improves query performance, and ensures you have the data needed when investigating issues.