Kubernetes Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas based on observed metrics. Instead of manually scaling deployments for varying load, HPA increases replicas when demand rises and decreases them when demand falls. This enables cost-efficient resource utilization while maintaining application performance.
HPA observes metrics, compares them to targets, and calculates the desired replica count. The scaling algorithm aims to bring average metric values to target levels. Understanding how HPA calculates desired replicas helps you configure it effectively.
HPA Basics
HPA watches metrics and adjusts replica counts to maintain target utilization. The basic formula is: desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue)).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
This HPA maintains 2-10 replicas of the api deployment, targeting 70% average CPU utilization. If average CPU reaches 100%, HPA calculates: ceil(currentReplicas × 100/70) = ceil(currentReplicas × 1.43), scaling up by roughly 43%.
Resource metrics require resource requests on containers. Without requests, HPA can't calculate utilization percentages.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
resources:
requests:
cpu: 200m # Required for CPU-based HPA
memory: 256Mi # Required for memory-based HPA
limits:
cpu: 500m
memory: 512Mi
Multiple Metrics
HPA can scale based on multiple metrics simultaneously. It calculates desired replicas for each metric and uses the highest value.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
# Scale based on CPU
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale based on memory
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Scale based on requests per second
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100
If CPU suggests 5 replicas, memory suggests 3, and RPS suggests 8, HPA scales to 8 replicas. This ensures no metric exceeds its target.
Custom Metrics
Beyond CPU and memory, HPA can scale on custom metrics from your application or external systems. This requires a metrics adapter like Prometheus Adapter.
# Prometheus Adapter configuration
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
- seriesQuery: 'queue_depth{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'
Use custom metrics in HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 1
maxReplicas: 50
metrics:
# Scale based on queue depth
- type: External
external:
metric:
name: queue_depth
selector:
matchLabels:
queue: default
target:
type: AverageValue
averageValue: 30 # 30 jobs per worker
Scaling Behavior
HPA v2 provides fine-grained control over scaling behavior. You can configure how fast scaling happens and add stabilization windows.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100 # Double pods
periodSeconds: 15
- type: Pods
value: 4 # Or add 4 pods
periodSeconds: 15
selectPolicy: Max # Use whichever adds more
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10 # Remove 10% of pods
periodSeconds: 60
selectPolicy: Min # Conservative scale-down
The stabilization window prevents thrashing. For scale-down, HPA considers the highest recommendation over the window period, avoiding premature scale-down during brief load drops.
Pod Disruption Budget Integration
Coordinate HPA with Pod Disruption Budgets to ensure availability during scaling operations.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # Always keep at least 2 pods
selector:
matchLabels:
app: api
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
minReplicas: 3 # HPA min > PDB minAvailable
# ...
Set HPA minReplicas higher than PDB minAvailable. This ensures enough headroom for rolling updates and voluntary disruptions.
Debugging HPA
When HPA doesn't behave as expected, check its status for detailed information.
# Detailed HPA status
kubectl describe hpa api-hpa
# Current metrics and desired state
kubectl get hpa api-hpa -o yaml
# Check metrics availability
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq .
# Check custom metrics
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second" | jq .
Common issues include missing resource requests (HPA can't calculate utilization), metrics server not running (no metrics available), custom metrics adapter misconfigured (metrics not exposed), and targets set too low (constant scaling).
// Expose metrics endpoint for custom metrics
class MetricsController extends Controller
{
public function index(): Response
{
$metrics = [];
// HTTP requests per second (counter for Prometheus)
$requestCount = Cache::get('metrics:requests:count', 0);
$metrics[] = "# HELP http_requests_total Total HTTP requests";
$metrics[] = "# TYPE http_requests_total counter";
$metrics[] = "http_requests_total $requestCount";
// Queue depth (gauge)
$queueDepth = Queue::size('default');
$metrics[] = "# HELP queue_depth Current queue depth";
$metrics[] = "# TYPE queue_depth gauge";
$metrics[] = "queue_depth{queue=\"default\"} $queueDepth";
return response(implode("\n", $metrics))
->header('Content-Type', 'text/plain; version=0.0.4');
}
}
HPA with Vertical Pod Autoscaler
HPA scales horizontally (more pods). Vertical Pod Autoscaler (VPA) scales vertically (bigger pods). Use them together carefully—they can conflict if both try to adjust the same resources.
# VPA for right-sizing resource requests
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi
controlledResources:
- cpu
- memory
When using both, configure VPA to only adjust resources HPA doesn't use. If HPA scales on CPU, let VPA adjust only memory:
resourcePolicy:
containerPolicies:
- containerName: api
controlledResources:
- memory # VPA controls memory only
# HPA controls CPU-based horizontal scaling
Scaling Patterns
Different workloads need different scaling approaches.
Web APIs: Scale on CPU or request rate. Aggressive scale-up, conservative scale-down.
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
Queue workers: Scale on queue depth. Scale to zero when idle (requires KEDA or custom implementation).
metrics:
- type: External
external:
metric:
name: queue_depth
target:
type: AverageValue
averageValue: 10
Batch processing: Scale on job completion rate or pending work. Tolerate longer scale-up latency.
Conclusion
HPA automates horizontal scaling based on observed metrics. Configure appropriate targets for your workload. Use multiple metrics when a single metric doesn't capture demand accurately. Tune scaling behavior to balance responsiveness against stability.
Monitor HPA behavior and adjust based on observed performance. Scaling too aggressively wastes resources; scaling too conservatively hurts performance. The right configuration depends on your application's characteristics and traffic patterns.