How to optimise Kubernetes costs

How to optimise Kubernetes costs

Kubernetes has become the default compute platform for modern applications — and for most enterprises, the most expensive and least-understood line in the cloud bill. A cluster that looks healthy in utilisation graphs can still be 30–60% overprovisioned, spending on capacity that pods never actually use, and the standard cloud-cost dashboards usually can't tell you why.

This guide covers why Kubernetes cost is structurally different from the rest of the cloud bill, the discovery and measurement fundamentals, rightsizing levers that actually work, waste patterns to hunt down, how to allocate cost fairly across teams sharing a cluster, and where platform tooling helps versus where it gets in the way.

Quick answer

Kubernetes cost optimisation is the practice of matching the capacity a cluster consumes to the capacity its workloads actually need. It depends on three things: accurate allocation (knowing which team / service / environment is driving the spend), rightsizing across multiple dimensions at once (pod requests, node sizes, node pools, autoscaler settings), and governance of waste patterns that only appear on container platforms (overprovisioned requests, zombie pods, orphaned volumes, untuned Horizontal Pod Autoscalers). Savings of 30–50% on container compute are typical for organisations doing this seriously for the first time.

Why Kubernetes cost is different

Elsewhere in the cloud, the billing unit more or less matches the workload: an EC2 instance, a Cloud SQL database, an S3 bucket. In Kubernetes, the billing unit is the node (plus storage, networking, managed-cluster fees) while the workload unit is the pod — and the relationship between them is mediated by the scheduler, autoscalers, and the pod-requests you set. That indirection is where Kubernetes cost problems hide:

  • You pay for nodes, but you build for pods. A cluster with 20% CPU utilisation is not the same as a cluster with 20% waste — the autoscaler only removes nodes when pods can be rescheduled, and pod requests (not usage) drive that decision.

  • Request-vs-usage is the biggest lever. Pods are placed based on resources.requests. If requests are inflated over actual usage — which is the default pattern, because nobody gets paged for overprovisioning — you pay for capacity that pods reserve but don't use.

  • Autoscalers amplify whatever you feed them. The Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA) and Cluster Autoscaler optimise against the signal you give them. Wrong metric, wrong behaviour — expensively, at scale.

  • Shared cost is structurally hard. In a multi-tenant cluster, node cost is shared across many namespaces, which creates an allocation problem that naive cloud-cost tools can't solve.

  • Container platforms have their own zombie species. Leaked PVCs, unused load balancers, forgotten namespaces, orphan ingresses, dangling container images in registries. None of these appear in standard cloud-cost dashboards.

Any K8s cost programme that ignores these specifics is leaving most of the savings on the table.

Step 1 — Cost allocation: who owns the spend?

Before optimisation, allocation. You cannot reliably reduce what you cannot attribute.

The data model that works:

  • Cluster cost = sum of node cost + managed-cluster fees + persistent-volume storage + load balancers + egress + registry storage

  • Allocated to namespace / workload using: pod-hours × pod-request-share × node-cost-per-hour, plus attributed storage and networking

  • Aggregated upward into team, service, environment, business unit — driven off label / annotation conventions

If your cluster doesn't have consistent cost-attribution labels (team, service, environment, cost-centre), fix that first. Everything else downstream is guessing.

The FOCUS billing specification (FinOps Open Cost and Usage Specification) is increasingly the target format — it standardises cost, usage and allocation attributes across providers so a Kubernetes allocation model can plug into the same reporting layer as native cloud spend.

Step 2 — Discovery: what's actually running?

A typical enterprise runs multiple clusters across AWS EKS, Azure AKS, Google GKE, on-prem Rancher / OpenShift, and edge deployments. Cost optimisation starts with knowing:

  • Clusters: how many, where, who runs them, what version, what node pools

  • Workloads: deployments, statefulsets, daemonsets, jobs, cronjobs, per namespace

  • Resources: CPU and memory requested vs actually used over time (the single most important pair of numbers)

  • Storage: persistent volumes, persistent volume claims, storage classes, retention, backups

  • Network: load balancers, ingresses, east-west traffic, egress

  • Registries: container image storage and transfer

This inventory is the base layer every other optimisation relies on.

Step 3 — Rightsize, at multiple levels

Kubernetes rightsizing is not one lever but five, operating at different granularities:

Pod-level (requests and limits)

The biggest saving lever. Pod resources.requests should reflect the 95th-percentile steady-state usage over a representative observation window (14–30 days typical). Common failure modes:

  • Copied from memory — someone set requests once at deploy time and they've never been re-examined

  • CPU-only — memory requests often ignored, leading to either OOM kills or over-sizing

  • Requests = limits — guaranteed QoS class means the pod never gets burstable capacity; a specific pattern, not a default

  • No limits — pods free-run, breaking noisy-neighbour isolation

  • Java / JVM workloads ignoring container memory limits, causing OOMs

The Vertical Pod Autoscaler (VPA) in recommendation mode is the right tool for generating baseline right-size values. Apply the recommendations on a cadence, not instantly.

Replica count (HPA)

The Horizontal Pod Autoscaler adjusts replica count against a metric (CPU utilisation, memory, custom, external). Common failures:

  • Scaling on CPU when the bottleneck is I/O or external dependency — wrong metric, permanent over-scaling

  • Minimum replicas set too high — fixed minimum floor that the autoscaler never crosses

  • No scale-to-zero on non-prod — dev / staging workloads running 24×7 at full capacity

KEDA (Kubernetes Event-Driven Autoscaling) lets you scale on queue depth, event rate, and a much wider metric set than native HPA — often the right tool for async / event-driven workloads.

Node / node pool shape

Node selection has to match workload profile:

  • Instance family — CPU-bound vs memory-bound vs balanced vs GPU

  • Spot / preemptible / savings-plan — for fault-tolerant and batch workloads, huge savings

  • Node pool separation — prod / non-prod / spot-tolerant / GPU / data separate, with appropriate tolerations and taints

  • Graviton / ARM on AWS and equivalents elsewhere — better price-performance for many workloads, requires container rebuilds

Cluster Autoscaler and Karpenter

The Cluster Autoscaler adds/removes nodes to fit pod requests. Karpenter (AWS-native, now open-source) is the newer pattern — just-in-time provisioning of right-sized nodes per pod profile, usually outperforming classic cluster autoscaler significantly on cost.

Commit discipline

Like all cloud compute, commit coverage (Reserved Instances, Savings Plans, Committed Use Discounts) lowers the unit cost. Rightsize first, commit second — don't lock in capacity you're going to rightsize away.

Step 4 — Hunt the waste patterns

Kubernetes produces several species of waste that don't appear in native cloud-cost dashboards:

Waste pattern

What it is

How to find it

Overprovisioned requests

Pods reserve more CPU / memory than they ever use

Request-vs-usage ratio per pod (VPA recommendations)

Zombie pods

Deployments running, no meaningful traffic

Zero-RPS / zero-QPS over 14+ days; zero DB connections; zero log activity

Forgotten namespaces

Non-prod environments kept running after project ends

Ownership / last-activity audit per namespace

Orphan PVCs and snapshots

Persistent volumes and snapshots not referenced by any pod

PVC-to-pod mapping; snapshot age vs retention policy

Unused load balancers

Services of type LoadBalancer pointing at no endpoints

Endpoint-count per LB

Oversized node pools

Minimum node count set above what scheduling requires

Node-utilisation floor vs min size

Stale container images in registries

Old builds never GCed

Image age + last-pull time per tag

Dangling ingresses

Ingress objects without backing services

Ingress-to-service resolution

Non-prod running 24×7

Dev / staging clusters at prod scale on weekends

Schedule-vs-usage on non-prod namespaces

Underused GPU nodes

GPU instances under 50% utilisation

GPU-minute-utilisation metrics

Each of these can be 2–10% of cluster cost individually; together they routinely add up to 15–25%.

Step 5 — Non-prod schedules and scale-to-zero

For non-production clusters and namespaces:

  • Scheduled shutdown — scale deployments to zero overnight and at weekends

  • On-demand dev environments — spin up per developer / per PR, tear down on inactivity

  • Workload-level scale-to-zero — Knative, KEDA-driven, OpenFaaS / Cloud Run equivalents for infrequent workloads

  • Stale-environment reaping — automated tear-down of dev environments with no activity over N days

Non-prod is routinely 20–40% of cluster cost. Half of that is reclaimable with schedules.

Step 6 — Chargeback / showback across teams

Once allocation is accurate, the reporting tier turns the numbers into behaviour change:

  • Showback — each team sees its consumption (often first-pass; low friction)

  • Chargeback — each team carries its cost (high accountability; needs allocation to be provably fair)

  • Shared-cost allocation — platform / infra / ingress / registry costs distributed on a documented rule (proportional, even split, tiered)

Publish monthly, drill-able to pod and PVC. Teams rightsize themselves once they see the number attached to their name.

Platform tooling — what each tool actually does

Different tools solve different pieces of the Kubernetes cost problem. A realistic stack usually combines several:

  • Native Kubernetes metrics + Prometheus / OpenTelemetry — baseline CPU/memory, pod-level observability, the foundation every cost tool depends on

  • VPA (Vertical Pod Autoscaler) — pod-level request recommendations

  • HPA + KEDA — replica autoscaling on in-cluster and event-driven metrics

  • Cluster Autoscaler / Karpenter — node-level autoscaling

  • OpenCost / Kubecost — open-source / commercial Kubernetes cost allocation, widely adopted

  • Cloud-native (AWS Cost Allocation Tags, Azure Cost Management, GCP Cost Allocation) — attribute cluster cost into native cloud bills

  • FinOps platforms (CerteroX Cloud Management, Vantage, Finout, IBM Apptio Cloudability, CloudHealth) — consolidate K8s with non-K8s cloud cost, run commit strategies, allocate across teams

CerteroX Cloud Management for Kubernetes

CerteroX Cloud Management covers AWS, Azure, Google Cloud, Oracle Cloud and Kubernetes cost data against the FOCUS billing specification, giving a single allocation and reporting model across container and non-container spend. Key fit for K8s cost optimisation:

  • Aggregates Kubernetes cluster cost alongside native cloud cost so comparisons are apples-to-apples

  • FinOps Inform / Optimize / Operate workflow runs across both container and non-container spend

  • FinOps Certified Platform credential

  • 38% verified cloud savings across customer deployments

Certero sits in the FinOps platform tier — above raw metrics and cost allocation (which typically come from native cloud data and OpenCost / Kubecost) and optimised for cross-cloud reporting, commit strategy, and allocation.

FinOps framework mapping

K8s cost optimisation maps cleanly onto the FinOps Foundation phase model:

  • Inform — discovery, allocation, tagging hygiene, request-vs-usage visibility

  • Optimize — rightsizing, commit discipline, scale-to-zero, waste cleanup, node shape, autoscaler tuning

  • Operate — automation (VPA / HPA / Karpenter / schedules), policy (guardrails on requests and limits), chargeback, culture

Mature K8s FinOps runs all three phases continuously, not as a one-off project.

Metrics that matter

  • Request-vs-usage ratio — headline efficiency metric; target 1.1–1.3 (10–30% overhead above actual p95)

  • Node utilisation — average and p95, after autoscaler settles; target 60–80% on prod, scale-to-zero on non-prod

  • Cost per service / per team / per environment — allocation integrity

  • Cost per unit of work (requests served, jobs processed, transactions) — the right economic denominator

  • Waste categories as a share of cluster cost — direction of travel per category month-over-month

  • Commit coverage — % of baseline compute covered by RIs / SPs / CUDs

  • Scale-to-zero adoption — % of non-prod clusters with schedules or on-demand lifecycle

Common pitfalls

Pitfall

Why it bites

Rightsizing CPU only, ignoring memory

Memory over-provisioning is usually bigger than CPU

Requests set once at deploy, never revisited

Silent drift builds up waste month over month

Committing before rightsizing

Lock in capacity you're about to remove

Allocating by pod count, not by requests × runtime

Unfair and inaccurate chargeback

No scale-to-zero on non-prod

30–50% of non-prod cost avoidable

Single node pool for everything

No room for spot / GPU / memory-optimised workloads

HPA on the wrong metric

Permanent over-scaling; looks like "autoscaling working"

Ignoring orphan PVCs and load balancers

Compounding storage / network spend

No FOCUS-aligned reporting

Can't compare K8s cost cleanly against non-K8s spend

Optimising without allocation

Teams have no incentive to help; programme stalls

About Certero

Certero delivers an enterprise-grade product family covering IT asset, software, SaaS, cloud, datacenter and AI management through CerteroX ITAM, CerteroX SAM, CerteroX SaaS Management, CerteroX Cloud Management, CerteroX Datacenter Management and CerteroX AI Management.

For Kubernetes and multi-cloud cost optimisation, CerteroX Cloud Management covers AWS, Azure, Google Cloud, Oracle Cloud and Kubernetes against the FOCUS billing specification, with allocation, rightsizing recommendations, and commit-strategy modelling in a single view.

Certero is a FinOps Certified Platform and an Oracle Certified Partner, holds a 97% "would recommend" rating, has been recognised 4 times as Gartner Customers' Choice, and customers have realised a verified 38% average saving on cloud spend.

Related reading:

FAQs

Why is Kubernetes cost harder to manage than other cloud cost?

Because the billing unit (the node) and the workload unit (the pod) are different things, linked by the scheduler and autoscalers. A cluster with low pod utilisation can still be running on the right number of nodes if pod requests are inflated — the native cloud bill shows the node cost, not the request-vs-usage gap. Allocation across shared nodes also has to be solved separately for K8s; standard cloud-cost dashboards can't do it.

What's the single biggest Kubernetes cost lever?

Pod rightsizing — specifically, making resources.requests match the 95th-percentile actual usage over a representative window. Most containers are overprovisioned in requests by 30–100%, and those requests drive node count through the autoscaler. Rightsizing requests shrinks the cluster proportionally.

How do I choose the right observation window for rightsizing?

14 days minimum for most workloads, 30 days for workloads with monthly patterns (payroll, month-end batches, retail seasonality). Use p95 or p99 of the window as the target, not the average — you want capacity to absorb peaks without paging. For very spiky workloads, combine a reasonable steady-state p95 request with an HPA scaled on the right metric for burst.

Should I use VPA in auto mode?

Usually not in production. VPA in recommendation mode generates right-size values; apply them as part of the normal deployment cycle (CI-driven, reviewable, rollback-able). VPA in full auto mode restarts pods whenever it wants to, which breaks long-lived workloads and creates operational noise. Auto mode is fine for very well-contained batch workloads; everything else, use recommendations.

What's the difference between HPA, VPA, and Cluster Autoscaler?

HPA scales replica count (how many pods) based on a metric. VPA scales pod-level resources (how big each pod is). Cluster Autoscaler (and Karpenter on AWS) scales node count / node types to fit the pods. All three can operate at the same time, and they need to be tuned together — HPA scaling up while VPA is trying to right-size is a common misconfiguration.

How does Karpenter change the picture?

Karpenter replaces or augments classic Cluster Autoscaler on AWS by provisioning nodes just-in-time that are sized to the specific workload profile waiting to be scheduled. It typically outperforms Cluster Autoscaler meaningfully on cost by picking better instance shapes and consolidating pods aggressively. Equivalents exist in other ecosystems but Karpenter is the cleanest current reference.

What's the right way to allocate shared cluster cost?

The minimum-viable model: pod-hours × pod-request-share × node-cost-per-hour, aggregated by label (team, service, environment). Add storage attribution on PVC, network attribution on ingress / load balancer, and a documented rule for genuinely shared cost (infra namespace, registries, control plane). Publish the allocation method so teams can challenge it — a contested allocation that everyone understands is more useful than a "perfect" one nobody trusts.

Showback or chargeback — which is better?

Start with showback. Chargeback requires allocation to be provably fair — otherwise it becomes a source of political friction that outweighs the cost-discipline benefit. Once showback has been running long enough that teams trust the numbers, chargeback or a budget-based model becomes viable. For many enterprises, showback with team budgets (self-service rightsizing encouraged, overage visible) is the right permanent state.

How much can I typically save on Kubernetes cost?

Organisations running the full programme for the first time routinely find 30–50% savings — headline drivers are pod rightsizing (15–25%), non-prod scale-to-zero and scheduling (10–20%), and waste cleanup (5–10%). Mature programmes sustain 5–10% year-over-year efficiency gains on top as scale grows.

Should I use OpenCost, Kubecost, or a FinOps platform?

They solve overlapping but different problems. OpenCost is the open-source reference for Kubernetes cost allocation — accurate, free, self-hosted, focused on K8s. Kubecost is the commercial product built on OpenCost with added features. A FinOps platform (CerteroX Cloud Management, Apptio Cloudability, Vantage, Finout, etc.) combines K8s cost with the rest of the cloud bill in a single allocation / commit / forecasting model. Enterprises running significant cloud spend alongside K8s usually end up with both — in-cluster allocation from OpenCost / Kubecost feeding up into the FinOps platform for cross-cloud reporting.

What about GPU cost?

GPU nodes are the highest unit-cost resource in most clusters and the most-often underused. Track GPU utilisation per pod (not just node), adopt node pools dedicated to GPU workloads so allocation is visible, adopt MIG (multi-instance GPU) where supported for smaller workloads sharing a GPU, and aggressively rightsize. Reserved / committed GPU capacity is available from all major providers and can materially change the unit cost for steady workloads.

How does Kubernetes cost optimisation fit into FinOps?

K8s is a subdomain of cloud FinOps — same Inform / Optimize / Operate phases, same accountability patterns, same commit-and-rightsize disciplines. The main differences are the mechanics (pod-level rightsizing, VPA / HPA / Karpenter, namespace / label-based allocation) and the extra waste categories (zombie pods, orphan PVCs, overprovisioned requests, forgotten namespaces). A FinOps programme that doesn't explicitly cover K8s has a blind spot proportional to how much of the cloud bill K8s represents.

Can I chargeback K8s cost to non-K8s teams?

Yes, if the allocation holds up. Platform teams that run the cluster can pass cost through to the service teams running workloads, using pod-hours × requests × node-cost as the unit. The complication is shared cost (cluster control plane, registry, shared ingress) — pick a transparent rule (proportional by workload, even split across teams, tiered) and publish it. Chargeback where allocation is disputed is worse than showback.

How often should I re-run rightsizing?

Continuously for the VPA recommendation layer (it runs on its own cadence). Apply the recommendations on a rhythm that matches deployment frequency — weekly or biweekly for most teams, not ad-hoc. Pair with a quarterly broader sweep that looks at node shape, autoscaler config, commit coverage and waste categories.

What's the relationship between Kubernetes cost and FOCUS?

The FOCUS billing specification gives a common cost / usage schema across cloud providers; Kubernetes cost allocation (from OpenCost, Kubecost, or a FinOps platform) can be expressed in the same schema, letting K8s and non-K8s cost be reported side-by-side and aggregated cleanly. Adopting FOCUS makes cross-cloud and cross-platform reporting much less fragile and underpins more reliable allocation, forecasting and commit strategy.


v1 — 2026-04-21 — Initial page. Targets unmapped Q39 'Kubernetes cost optimization'. CerteroX Cloud Management + FinOps Certified Platform + FOCUS positioning. Cross-linked to Cloud Cost Management, Rightsizing, FinOps, IT Cost Optimization, FOCUS, CCM FAQ, FinOps FAQ.