Kubernetes Cost Optimization: Right-Sizing Without Sacrificing Reliability

The average Kubernetes cluster runs at 15-25% CPU utilization. That means teams are paying for three to six times more compute than they're using. Cost optimization isn't about being cheap — it's about reclaiming budget that's being wasted on idle capacity and reinvesting it where it matters.

This is a practical guide to finding and eliminating waste without making your systems fragile.

Understanding Where Your Money Goes

Before optimizing, you need visibility. Most teams are surprised to discover which workloads actually cost the most.

Cluster Visibility with Kubecost

Kubecost (open source) provides per-namespace, per-deployment, and per-label cost breakdowns:

# Install Kubecost
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm upgrade --install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace \
  --set kubecostToken="your-token-here"

# Forward to local
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090

With Kubecost running, you'll see allocation by namespace and deployment — which teams are spending what, and where the biggest opportunities are.

Spotting Over-Provisioned Pods

Look for the gap between requested resources and actual usage:

# View resource requests vs actual usage
kubectl top pods --all-namespaces

# Compare to what's requested
kubectl get pods --all-namespaces -o json | jq '
.items[] | {
  name: .metadata.name,
  namespace: .metadata.namespace,
  cpu_request: .spec.containers[].resources.requests.cpu,
  memory_request: .spec.containers[].resources.requests.memory
}'

If you see a pod requesting 2 CPU cores and using 100m consistently, that's 95% waste in CPU reservation.

Right-Sizing Resource Requests and Limits

This is the highest-leverage optimization. Wrong resource settings waste money (over-request) or cause outages (under-request).

The Difference Between Requests and Limits

resources:
  requests:
    memory: "256Mi"   # Reserved for scheduling. Scheduler uses this.
    cpu: "250m"       # Reserved. Node won't accept pod if this isn't available.
  limits:
    memory: "512Mi"   # Hard ceiling. Exceeding this kills the pod (OOMKilled).
    cpu: "1000m"      # Soft ceiling. CPU is throttled, not killed.

Requests determine scheduling and node capacity allocation. Even if your pod uses 50m CPU, if it requests 500m, that 450m is unavailable to other workloads.

Vertical Pod Autoscaler for Recommendations

VPA observes actual usage and recommends (or automatically sets) better resource requests:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"   # Recommendation mode — don't auto-update yet
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "50m"
          memory: "64Mi"
        maxAllowed:
          cpu: "2"
          memory: "1Gi"

After a week of data collection, check recommendations:

kubectl describe vpa api-vpa

# Output includes:
# Recommendation:
#   Container Recommendations:
#     Container Name: api
#     Lower Bound:    cpu: 80m    memory: 180Mi
#     Target:         cpu: 120m   memory: 240Mi
#     Upper Bound:    cpu: 500m   memory: 400Mi

The "Target" is your new baseline. Adjust requests to match, measure for stability, then repeat.

Common Right-Sizing Rules of Thumb

Set requests to:  P95 of actual usage over 7 days
Set limits to:    2x-3x requests for most workloads
                  For memory: closer to 1.5x (OOM kills are harsh)
                  For CPU: 4x-5x (throttling is gentler than OOM)

Exception: Don't set CPU limits for latency-sensitive services.
CPU throttling adds unpredictable latency spikes.

Horizontal Pod Autoscaling

Don't run at peak capacity 24/7. Scale with demand.

CPU-Based HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0    # Scale up immediately
      policies:
        - type: Pods
          value: 4
          periodSeconds: 15

The asymmetric scaling behavior is intentional: scale up aggressively (don't make users wait), scale down conservatively (avoid flapping).

Custom Metrics HPA

For queue-based workloads, scale on queue depth rather than CPU:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 1
  maxReplicas: 50
  metrics:
    - type: External
      external:
        metric:
          name: sqs_approximate_number_of_messages_visible
          selector:
            matchLabels:
              queue_name: "jobs-queue"
        target:
          type: AverageValue
          averageValue: "30"  # 1 worker per 30 messages

This requires a metrics adapter (KEDA is excellent for this), but it means your worker fleet scales directly with actual workload pressure.

Cluster Autoscaler and Node Right-Sizing

HPA handles pod count. Cluster Autoscaler handles node count.

# Cluster Autoscaler deployment (simplified)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
        - name: cluster-autoscaler
          command:
            - ./cluster-autoscaler
            - --cloud-provider=aws
            - --nodes=2:20:eks-node-group-name
            - --scale-down-utilization-threshold=0.5
            - --scale-down-delay-after-add=5m
            - --skip-nodes-with-local-storage=false

With scale-down-utilization-threshold=0.5, nodes at less than 50% utilization are candidates for removal. Workloads get rescheduled and the node is terminated.

Node Instance Type Strategy

Don't use one instance type for everything:

General workloads:   m6i.xlarge, m6i.2xlarge  (balanced)
CPU-intensive:       c6i.2xlarge, c6i.4xlarge  (compute optimized)
Memory-intensive:    r6i.2xlarge               (memory optimized)
ML/GPU:              g4dn.xlarge               (GPU)
Arm/cost-efficient:  m7g.xlarge (Graviton3)    (20-40% cheaper than x86)

ARM-based instances (AWS Graviton, GCP Tau T2A) often deliver 20-40% better price-performance for most workloads. If your containers support multi-arch (they should), this is free savings.

Spot Instances: The Big Win

Spot instances (AWS Spot, GCP Preemptible, Azure Spot) cost 60-90% less than on-demand. They can be reclaimed with 2 minutes warning. Used correctly, they're safe.

Categorizing Workloads for Spot

Ideal for spot:
  - Batch jobs and data processing
  - CI/CD build runners
  - Stateless workers and queue consumers
  - Non-production environments
  - Fault-tolerant web serving (with multiple replicas)

Not suitable for spot:
  - Single-replica critical services
  - Stateful workloads without fast failover
  - Jobs that can't checkpoint and resume
  - Services with very long startup times

Spot Node Groups in EKS

# Terraform: mixed on-demand and spot node group
resource "aws_eks_node_group" "workers" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "workers"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids

  capacity_type = "SPOT"

  instance_types = [
    "m6i.xlarge",
    "m6a.xlarge",
    "m5.xlarge",
    "m5a.xlarge",
  ]
  # Multiple instance types reduces interruption risk

  scaling_config {
    desired_size = 5
    min_size     = 2
    max_size     = 50
  }
}

Using multiple similar instance types (all 4-vCPU, 16GB) is critical. If your spot pool only has one instance type, an interruption event could affect all nodes simultaneously.

Handling Spot Interruptions Gracefully

# Pod Disruption Budget — limits simultaneous disruptions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2   # Always keep 2 running during disruptions
  selector:
    matchLabels:
      app: api

# Deployment anti-affinity — spread pods across nodes
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                app: api

With anti-affinity and PDB, a single spot interruption will take down at most one pod, and the disruption budget ensures the service remains available while rescheduling.

Namespace-Level Resource Quotas

Prevent runaway costs from misconfigured workloads:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: payments-team
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "100"
    services: "20"

# LimitRange: default limits for pods that don't specify
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: payments-team
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"

Without LimitRange, pods without resource specifications get unlimited resources. One runaway pod can consume an entire node.

Scheduled Scaling for Predictable Traffic

If you know traffic drops at night, scale down proactively:

# Scale down non-prod at night
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-staging
spec:
  schedule: "0 20 * * 1-5"  # 8 PM weekdays
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: scaler
              image: bitnami/kubectl
              command:
                - kubectl
                - scale
                - deployment/api
                - --replicas=1
                - -n
                - staging

For development and staging environments, you can go further and implement complete cluster shutdown during off-hours using node group scaling.

Cost Allocation and Chargeback

Make costs visible to the teams generating them:

# Label everything consistently
metadata:
  labels:
    team: payments
    product: checkout
    environment: production
    cost-center: "eng-123"

Kubecost and cloud provider cost allocation tools use these labels to generate per-team reports. When teams see their infrastructure costs, behavior changes. The payment team that's running 20 replicas of a service at 3% utilization will right-size it when the cost is attributed to their budget.

Summary: Priority Order for Cost Reduction

Attack in this order for maximum ROI:

1. Right-size resource requests (free, immediate, 20-40% savings)
2. Implement HPA to match capacity to demand (reduces avg replicas)
3. Use spot instances for eligible workloads (60-80% node cost reduction)
4. Implement Cluster Autoscaler (removes idle nodes automatically)
5. Use ARM instances where possible (20-40% cheaper)
6. Implement namespace quotas (prevents waste from misconfigurations)
7. Schedule scaling for predictable off-peak periods
8. Enable cost allocation labels for accountability

Most teams that go through this process systematically find 40-60% savings without any reduction in reliability — often with improved reliability, because right-sized workloads behave more predictably.

Building something that needs to scale? We help teams architect systems that grow with their business. scopeforged.com

Kubernetes Cost Optimization: Right-Sizing Clusters Without Sacrificing Reliability