Kubernetes Horizontal Pod Autoscaling

Kubernetes Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics. Instead of manually scaling deployments for varying load, HPA increases replicas when demand rises and decreases them when demand falls. This enables cost-efficient resource utilization while maintaining application performance.

HPA observes metrics, compares them to targets, and calculates the desired replica count. The scaling algorithm aims to bring average metric values to target levels. Understanding how HPA calculates desired replicas helps you configure it effectively.

HPA Basics

HPA watches metrics and adjusts replica counts to maintain target utilization. The basic formula is: desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue)).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This HPA maintains 2-10 replicas of the api deployment, targeting 70% average CPU utilization. If average CPU reaches 100%, HPA calculates: ceil(currentReplicas × 100/70) = ceil(currentReplicas × 1.43), scaling up by roughly 43%.

Resource metrics require resource requests on containers. Without requests, HPA can't calculate utilization percentages.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
        - name: api
          resources:
            requests:
              cpu: 200m      # Required for CPU-based HPA
              memory: 256Mi  # Required for memory-based HPA
            limits:
              cpu: 500m
              memory: 512Mi

Multiple Metrics

HPA can scale based on multiple metrics simultaneously. It calculates desired replicas for each metric and uses the highest value.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    # Scale based on CPU
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    # Scale based on memory
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

    # Scale based on requests per second
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: 100

If CPU suggests 5 replicas, memory suggests 3, and RPS suggests 8, HPA scales to 8 replicas. This ensures no metric exceeds its target.

Custom Metrics

Beyond CPU and memory, HPA can scale on custom metrics from your application or external systems. This requires a metrics adapter like Prometheus Adapter.

# Prometheus Adapter configuration
rules:
  - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

  - seriesQuery: 'queue_depth{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)$"
      as: "${1}"
    metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'

Use custom metrics in HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 1
  maxReplicas: 50
  metrics:
    # Scale based on queue depth
    - type: External
      external:
        metric:
          name: queue_depth
          selector:
            matchLabels:
              queue: default
        target:
          type: AverageValue
          averageValue: 30  # 30 jobs per worker

Scaling Behavior

HPA v2 provides fine-grained control over scaling behavior. You can configure how fast scaling happens and add stabilization windows.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately
      policies:
        - type: Percent
          value: 100                    # Double pods
          periodSeconds: 15
        - type: Pods
          value: 4                      # Or add 4 pods
          periodSeconds: 15
      selectPolicy: Max                 # Use whichever adds more
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Percent
          value: 10                     # Remove 10% of pods
          periodSeconds: 60
      selectPolicy: Min                 # Conservative scale-down

The stabilization window prevents thrashing. For scale-down, HPA considers the highest recommendation over the window period, avoiding premature scale-down during brief load drops.

Pod Disruption Budget Integration

Coordinate HPA with Pod Disruption Budgets to ensure availability during scaling operations.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods
  selector:
    matchLabels:
      app: api
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  minReplicas: 3  # HPA min > PDB minAvailable
  # ...

Set HPA minReplicas higher than PDB minAvailable. This ensures enough headroom for rolling updates and voluntary disruptions.

Debugging HPA

When HPA doesn't behave as expected, check its status for detailed information.

# Detailed HPA status
kubectl describe hpa api-hpa

# Current metrics and desired state
kubectl get hpa api-hpa -o yaml

# Check metrics availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq .

# Check custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second" | jq .

Common issues include missing resource requests (HPA can't calculate utilization), metrics server not running (no metrics available), custom metrics adapter misconfigured (metrics not exposed), and targets set too low (constant scaling).

// Expose metrics endpoint for custom metrics
class MetricsController extends Controller
{
    public function index(): Response
    {
        $metrics = [];

        // HTTP requests per second (counter for Prometheus)
        $requestCount = Cache::get('metrics:requests:count', 0);
        $metrics[] = "# HELP http_requests_total Total HTTP requests";
        $metrics[] = "# TYPE http_requests_total counter";
        $metrics[] = "http_requests_total $requestCount";

        // Queue depth (gauge)
        $queueDepth = Queue::size('default');
        $metrics[] = "# HELP queue_depth Current queue depth";
        $metrics[] = "# TYPE queue_depth gauge";
        $metrics[] = "queue_depth{queue=\"default\"} $queueDepth";

        return response(implode("\n", $metrics))
            ->header('Content-Type', 'text/plain; version=0.0.4');
    }
}

HPA with Vertical Pod Autoscaler

HPA scales horizontally (more pods). Vertical Pod Autoscaler (VPA) scales vertically (bigger pods). Use them together carefully—they can conflict if both try to adjust the same resources.

# VPA for right-sizing resource requests
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 2
          memory: 2Gi
        controlledResources:
          - cpu
          - memory

When using both, configure VPA to only adjust resources HPA doesn't use. If HPA scales on CPU, let VPA adjust only memory:

resourcePolicy:
  containerPolicies:
    - containerName: api
      controlledResources:
        - memory  # VPA controls memory only
      # HPA controls CPU-based horizontal scaling

Scaling Patterns

Different workloads need different scaling approaches.

Web APIs: Scale on CPU or request rate. Aggressive scale-up, conservative scale-down.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 300

Queue workers: Scale on queue depth. Scale to zero when idle (requires KEDA or custom implementation).

metrics:
  - type: External
    external:
      metric:
        name: queue_depth
      target:
        type: AverageValue
        averageValue: 10

Batch processing: Scale on job completion rate or pending work. Tolerate longer scale-up latency.

Conclusion

HPA automates horizontal scaling based on observed metrics. Configure appropriate targets for your workload. Use multiple metrics when a single metric doesn't capture demand accurately. Tune scaling behavior to balance responsiveness against stability.

Monitor HPA behavior and adjust based on observed performance. Scaling too aggressively wastes resources; scaling too conservatively hurts performance. The right configuration depends on your application's characteristics and traffic patterns.

Kubernetes Horizontal Pod Autoscaling

HPA Basics

Multiple Metrics

Custom Metrics

Scaling Behavior

Pod Disruption Budget Integration

Debugging HPA

HPA with Vertical Pod Autoscaler

Scaling Patterns

Conclusion

Share this article

Related Articles

Memory Leaks in Production: Detection, Diagnosis, and Prevention

CDN Configuration: Beyond Just Caching Static Assets

Async Processing: Offloading Work to Improve Response Times

Need help with your project?

ScopeForged Assistant