Kubernetes is excellent at running things. It is much less opinionated about running them cheaply or safely. Most of the operational wins I have seen came not from clever YAML but from a handful of disciplines applied consistently.
Right-size before you autoscale
Autoscaling amplifies whatever requests and limits you have set. If every pod asks for far more CPU than it uses, the cluster scales out on phantom demand and the bill follows. Before touching the Horizontal Pod Autoscaler, look at actual usage and set requests close to the real working set:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
memory: 256MiA deliberate omission above: I usually leave CPU limits off and rely on requests plus the scheduler. CPU throttling from tight limits hurts tail latency far more than it saves money.
Hibernate what nobody is using
Not every workload needs to run around the clock. Internal tools, batch processors, and staging environments are often idle for most of the day. Scaling those to zero on a schedule — or behind a request-triggered wake-up — is one of the highest-leverage cost levers available.
In one environment, combining cluster autoscaling with scheduled hibernation of idle services cut compute cost by roughly 30% with no impact on the user-facing path. The trick is being honest about which workloads are truly latency-sensitive and which are not.
Decouple release from deploy
Shipping a binary and turning on a feature are two different events, and conflating them makes every deploy risky. Feature flags — I have used OpenFeature for this — let you deploy continuously and roll features out gradually:
enabled := client.Boolean(ctx, "new-screening-flow", false,
openfeature.NewEvaluationContext(userID, attrs))
if enabled {
return s.screenV2(ctx, app)
}
return s.screenV1(ctx, app)If the new path misbehaves, you flip a flag instead of rolling back a deployment. That single change does more for delivery confidence than most CI improvements.
GitOps keeps the cluster honest
Declarative delivery with a tool like Argo CD means the desired state lives in Git, not in someone's terminal history. The benefits compound:
- Every change is reviewed and audited.
- Drift is detected and reconciled automatically.
- Rollback is a
git revert, not an archaeology project.
Reliability is mostly about limits and probes
Two unglamorous settings prevent a surprising share of incidents:
- Readiness probes that fail before a pod starts taking traffic it cannot serve — for example, while a connection pool is still warming up.
- Memory limits paired with a sensible
GOMEMLIMITso the Go runtime collects garbage before the kernel OOM-kills the container.
Production Kubernetes rewards boring, consistent defaults far more than it rewards clever ones.