The average Kubernetes cluster runs at 15-25% CPU utilization. That means teams are paying for three to six times more compute than they're using. Cost optimization isn't about being cheap — it's about reclaiming budget that's being wasted on idle capacity and reinvesting it where it matters.
This is a practical guide to finding and eliminating waste without making your systems fragile.
Understanding Where Your Money Goes
Before optimizing, you need visibility. Most teams are surprised to discover which workloads actually cost the most.
Cluster Visibility with Kubecost
Kubecost (open source) provides per-namespace, per-deployment, and per-label cost breakdowns:
# Install Kubecost
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm upgrade --install kubecost kubecost/cost-analyzer \
--namespace kubecost \
--create-namespace \
--set kubecostToken="your-token-here"
# Forward to local
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090
With Kubecost running, you'll see allocation by namespace and deployment — which teams are spending what, and where the biggest opportunities are.
Spotting Over-Provisioned Pods
Look for the gap between requested resources and actual usage:
# View resource requests vs actual usage
kubectl top pods --all-namespaces
# Compare to what's requested
kubectl get pods --all-namespaces -o json | jq '
.items[] | {
name: .metadata.name,
namespace: .metadata.namespace,
cpu_request: .spec.containers[].resources.requests.cpu,
memory_request: .spec.containers[].resources.requests.memory
}'
If you see a pod requesting 2 CPU cores and using 100m consistently, that's 95% waste in CPU reservation.
Right-Sizing Resource Requests and Limits
This is the highest-leverage optimization. Wrong resource settings waste money (over-request) or cause outages (under-request).
The Difference Between Requests and Limits
resources:
requests:
memory: "256Mi" # Reserved for scheduling. Scheduler uses this.
cpu: "250m" # Reserved. Node won't accept pod if this isn't available.
limits:
memory: "512Mi" # Hard ceiling. Exceeding this kills the pod (OOMKilled).
cpu: "1000m" # Soft ceiling. CPU is throttled, not killed.
Requests determine scheduling and node capacity allocation. Even if your pod uses 50m CPU, if it requests 500m, that 450m is unavailable to other workloads.
Vertical Pod Autoscaler for Recommendations
VPA observes actual usage and recommends (or automatically sets) better resource requests:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # Recommendation mode — don't auto-update yet
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2"
memory: "1Gi"
After a week of data collection, check recommendations:
kubectl describe vpa api-vpa
# Output includes:
# Recommendation:
# Container Recommendations:
# Container Name: api
# Lower Bound: cpu: 80m memory: 180Mi
# Target: cpu: 120m memory: 240Mi
# Upper Bound: cpu: 500m memory: 400Mi
The "Target" is your new baseline. Adjust requests to match, measure for stability, then repeat.
Common Right-Sizing Rules of Thumb
Set requests to: P95 of actual usage over 7 days
Set limits to: 2x-3x requests for most workloads
For memory: closer to 1.5x (OOM kills are harsh)
For CPU: 4x-5x (throttling is gentler than OOM)
Exception: Don't set CPU limits for latency-sensitive services.
CPU throttling adds unpredictable latency spikes.
Horizontal Pod Autoscaling
Don't run at peak capacity 24/7. Scale with demand.
CPU-Based HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Pods
value: 4
periodSeconds: 15
The asymmetric scaling behavior is intentional: scale up aggressively (don't make users wait), scale down conservatively (avoid flapping).
Custom Metrics HPA
For queue-based workloads, scale on queue depth rather than CPU:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: worker
minReplicas: 1
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: sqs_approximate_number_of_messages_visible
selector:
matchLabels:
queue_name: "jobs-queue"
target:
type: AverageValue
averageValue: "30" # 1 worker per 30 messages
This requires a metrics adapter (KEDA is excellent for this), but it means your worker fleet scales directly with actual workload pressure.
Cluster Autoscaler and Node Right-Sizing
HPA handles pod count. Cluster Autoscaler handles node count.
# Cluster Autoscaler deployment (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --nodes=2:20:eks-node-group-name
- --scale-down-utilization-threshold=0.5
- --scale-down-delay-after-add=5m
- --skip-nodes-with-local-storage=false
With scale-down-utilization-threshold=0.5, nodes at less than 50% utilization are candidates for removal. Workloads get rescheduled and the node is terminated.
Node Instance Type Strategy
Don't use one instance type for everything:
General workloads: m6i.xlarge, m6i.2xlarge (balanced)
CPU-intensive: c6i.2xlarge, c6i.4xlarge (compute optimized)
Memory-intensive: r6i.2xlarge (memory optimized)
ML/GPU: g4dn.xlarge (GPU)
Arm/cost-efficient: m7g.xlarge (Graviton3) (20-40% cheaper than x86)
ARM-based instances (AWS Graviton, GCP Tau T2A) often deliver 20-40% better price-performance for most workloads. If your containers support multi-arch (they should), this is free savings.
Spot Instances: The Big Win
Spot instances (AWS Spot, GCP Preemptible, Azure Spot) cost 60-90% less than on-demand. They can be reclaimed with 2 minutes warning. Used correctly, they're safe.
Categorizing Workloads for Spot
Ideal for spot:
- Batch jobs and data processing
- CI/CD build runners
- Stateless workers and queue consumers
- Non-production environments
- Fault-tolerant web serving (with multiple replicas)
Not suitable for spot:
- Single-replica critical services
- Stateful workloads without fast failover
- Jobs that can't checkpoint and resume
- Services with very long startup times
Spot Node Groups in EKS
# Terraform: mixed on-demand and spot node group
resource "aws_eks_node_group" "workers" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "workers"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
capacity_type = "SPOT"
instance_types = [
"m6i.xlarge",
"m6a.xlarge",
"m5.xlarge",
"m5a.xlarge",
]
# Multiple instance types reduces interruption risk
scaling_config {
desired_size = 5
min_size = 2
max_size = 50
}
}
Using multiple similar instance types (all 4-vCPU, 16GB) is critical. If your spot pool only has one instance type, an interruption event could affect all nodes simultaneously.
Handling Spot Interruptions Gracefully
# Pod Disruption Budget — limits simultaneous disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # Always keep 2 running during disruptions
selector:
matchLabels:
app: api
# Deployment anti-affinity — spread pods across nodes
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: api
With anti-affinity and PDB, a single spot interruption will take down at most one pod, and the disruption budget ensures the service remains available while rescheduling.
Namespace-Level Resource Quotas
Prevent runaway costs from misconfigured workloads:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: payments-team
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "100"
services: "20"
# LimitRange: default limits for pods that don't specify
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: payments-team
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
Without LimitRange, pods without resource specifications get unlimited resources. One runaway pod can consume an entire node.
Scheduled Scaling for Predictable Traffic
If you know traffic drops at night, scale down proactively:
# Scale down non-prod at night
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-staging
spec:
schedule: "0 20 * * 1-5" # 8 PM weekdays
jobTemplate:
spec:
template:
spec:
containers:
- name: scaler
image: bitnami/kubectl
command:
- kubectl
- scale
- deployment/api
- --replicas=1
- -n
- staging
For development and staging environments, you can go further and implement complete cluster shutdown during off-hours using node group scaling.
Cost Allocation and Chargeback
Make costs visible to the teams generating them:
# Label everything consistently
metadata:
labels:
team: payments
product: checkout
environment: production
cost-center: "eng-123"
Kubecost and cloud provider cost allocation tools use these labels to generate per-team reports. When teams see their infrastructure costs, behavior changes. The payment team that's running 20 replicas of a service at 3% utilization will right-size it when the cost is attributed to their budget.
Summary: Priority Order for Cost Reduction
Attack in this order for maximum ROI:
1. Right-size resource requests (free, immediate, 20-40% savings)
2. Implement HPA to match capacity to demand (reduces avg replicas)
3. Use spot instances for eligible workloads (60-80% node cost reduction)
4. Implement Cluster Autoscaler (removes idle nodes automatically)
5. Use ARM instances where possible (20-40% cheaper)
6. Implement namespace quotas (prevents waste from misconfigurations)
7. Schedule scaling for predictable off-peak periods
8. Enable cost allocation labels for accountability
Most teams that go through this process systematically find 40-60% savings without any reduction in reliability — often with improved reliability, because right-sized workloads behave more predictably.
Building something that needs to scale? We help teams architect systems that grow with their business. scopeforged.com