Three cloud cost anti-patterns that survive every FinOps review

Most FinOps reviews chase the obvious savings — rightsizing idle EC2, committing to savings plans, deleting unattached volumes — and stop there. The three anti-patterns below outlive those sweeps because they look normal, they map to real workloads, and nobody has framed them as problems yet. They are usually where the next 15–25% lives.

Key takeaway

A FinOps review that only looks at compute is reading half the bill. The large, persistent waste in modern cloud accounts is in networking, data movement, and “temporary” architectures that became permanent.

Anti-pattern 1 — NAT gateway egress as a default

A single NAT gateway processing a terabyte a day is ~$1,400 a month in data-processing charges alone, before egress. Multiply by environments (prod, staging, dev), multiply by availability zones (NAT is per-AZ for resilience), and you have a four-to-six-figure line item that nobody owns.

The underlying cause is that NAT was the path of least resistance at day one. Every private-subnet workload needs to reach something public — ECR, S3, a vendor API — and NAT gets it there without design work. It stays because ripping it out means replacing each outbound path with a VPC endpoint, a PrivateLink connection, or a routing change, and that work is diffuse across teams.

What to look at:

VPC endpoints for AWS services eliminate NAT charges entirely for S3, DynamoDB, ECR, Secrets Manager, SSM, and KMS. These are the ones that add up.
PrivateLink for vendor APIs (Datadog, Snowflake, etc.) when offered.
Cross-AZ NAT consolidation — if you don’t need per-AZ redundancy for the egress path, one NAT in the lowest-traffic AZ is cheaper than three.

A thorough pass typically collapses NAT spend by 60–80%.

Anti-pattern 2 — Cross-AZ chatter between microservices

The AWS bill does not line-item “your microservices are talking to each other across availability zones.” It quietly charges $0.01/GB each way and calls it “data transfer — inter-AZ.” At any non-trivial service mesh, this is a recurring five-figure surprise.

Common miss

The same cross-AZ traffic shows up twice on the bill — once on the sender, once on the receiver. It is the only data-transfer category that charges both ends. Quick sanity check: divide total inter-AZ by two before you rationalize it.

The pattern that produces this is almost always a pod-to-pod or service-to-service call path that was never placement-aware. Kubernetes will happily schedule caller and callee in different zones. An RDS read replica sitting in a different zone to the application it serves will double the bill every hour.

Fixes:

Zone-aware service routing (topology-aware hints in Kubernetes, latency-based routing in a service mesh).
Placement groups or zone affinity for chatty caller-callee pairs.
Single-AZ non-production environments. There is no compliance argument for multi-AZ staging.

Anti-pattern 3 — Production-sized non-production

Staging is a scaled-down copy of production. In practice, staging is whatever production was two years ago, copied once, and never shrunk. The instance types are the same. The RDS replica count is the same. The ElastiCache cluster is the same. The load is one-twentieth.

This is the anti-pattern that survives the most reviews because the cost of a mistake — staging breaking during a deploy rehearsal — is visible, and the cost of the waste is not. Nobody gets paged for paying $8,000 a month to run an oversized staging cluster. Someone gets paged if staging goes down during the release window.

The fix is boring and effective:

Resize non-production to match its actual load, not the production load it notionally mirrors.
Shut down entire non-production accounts overnight and on weekends. A 168-hour week runs 120 hours of actual use; stopping the rest is a 30%+ saving.
Promote a staging environment to production-grade only when it is actively testing load. The rest of the time it is a functional-test environment, priced accordingly.

Why these survive the first pass

All three anti-patterns share a property: the wasteful state is indistinguishable from the correct state at a glance. A NAT gateway is correct. Cross-AZ traffic is correct. A staging cluster is correct. Nothing in the AWS console or Cost Explorer waves a flag. The flags come from asking one level deeper — what’s going through the NAT, where are the endpoints of the cross-AZ flows, how loaded is the staging cluster actually running.

That second-level read is the work. A competent FinOps engagement spends most of its time there, not in Cost Explorer. The savings are worth the time by a large margin — 15–25% of the bill is typical on a first pass, with the three anti-patterns above accounting for most of it.

What to change this quarter

Export a monthly view of NAT data processing by VPC. Anything over 500 GB/month is a VPC-endpoint candidate.
Split data-transfer charges by source/destination AZ. Anything with a large inter-AZ component is a placement review.
Right-size and schedule non-production environments. The tooling is mature; the blocker is policy.

None of this is surprising. It is just rarely done, and the bill notices.

Three cloud cost anti-patterns that survive every FinOps review

Anti-pattern 1 — NAT gateway egress as a default

Anti-pattern 2 — Cross-AZ chatter between microservices

Anti-pattern 3 — Production-sized non-production

Why these survive the first pass

What to change this quarter

Run a free audit to see if these patterns apply.

Continue reading.

Hardening GitHub Actions OIDC trust policies on AWS

Reading CloudTrail like an incident responder, not a compliance officer

IMDSv2 is not a migration project