The observability bill problem nobody wants to solve

A Series B SaaS spending $25,000 a month on AWS will routinely spend $40,000 a month on Datadog, or Splunk, or some combination of logging, APM, and infrastructure monitoring vendors. This is not an aberration. It is the default end state of an observability program that grew the way most observability programs grow: one integration at a time, every signal enabled, retention set to “in case we need it.”

Key takeaway

The fastest-growing line item in most SaaS P&Ls right now is observability. It is not growing because the infrastructure is growing — it is growing because nobody is paid to turn signals off.

How the bill got that large

Four mechanisms, in roughly the order of damage:

1. Custom metric cardinality

A single metric with ten tag keys, each taking one of a hundred values, is ten billion unique time series. Most APM vendors price per custom metric. Nobody set out to emit ten billion time series — somebody added a user_id tag to a Prometheus histogram and the bill went up by a factor of a thousand overnight.

2. Log ingestion at debug level in production

The default log level in most frameworks is INFO. Production workloads that ship at DEBUG — because somebody was debugging an incident six months ago and nobody turned it back — produce 50–100x the log volume of a sensibly-configured deployment, at 50–100x the ingestion cost.

3. APM trace sampling set to 100%

Distributed tracing at 100% sampling is a research tool. At production scale it is not affordable and it is not actionable — 99.9% of the traces you collect you will never look at. Vendors default to high sampling because it looks good in the demo; the team that implemented it rarely revisits the setting.

4. Retention defaults nobody reviewed

Logs retained 90 days. Metrics retained 15 months. Traces retained 30 days. Multiply by ingestion volume. Every vendor has a retention setting and every vendor defaults it generously because retention drives revenue.

The triage order

When you have a 30-day window to cut the bill without losing operational capability, the order matters.

First — audit custom metric cardinality

This is the single highest-ROI exercise in observability cost reduction. Most APM vendors expose a top-N report of custom metrics by cardinality. The worst offenders will almost always be internal metrics with a user ID, session ID, or request ID baked into a tag. The fix is to remove the high-cardinality tag and, if the dimension is actually needed, emit it as a log field instead.

Typical savings: 20–40% of the APM line item.

Second — separate error logs from access logs

Error logs are low-volume, high-value, and need full retention. Access logs are high-volume, low-value per line, and need querying for at most a week. Shipping both to the same ingestion tier at the same retention is the single most common log-cost mistake. Split them. Route access logs to S3 with Athena on top. Keep errors in the APM vendor.

Typical savings: 40–60% of the logging line item.

Third — reduce APM sampling to what you actually review

The right sampling rate is the rate at which you can still reconstruct a specific slow request on demand, but not the rate at which you retain every request forever. For most production services, that’s 1–5% head-based sampling plus 100% tail-based sampling for errors and high-latency outliers. Adaptive sampling is mature in every major APM vendor and is almost never turned on by default.

Typical savings: 30–50% of the trace line item.

Fourth — set retention to actual investigation windows

Ask a responder how far back they have needed to look in the last year. The answer is almost always “seven days,” sometimes “thirty days,” rarely longer. Set retention to match, with a longer-term cold-storage tier for compliance evidence.

Pattern

Compliance retention is a separate problem from operational retention. Ship a compliance-grade copy to cheap storage (S3 with Object Lock, typically) and set the operational tier to the shorter window. You will pay twice, but the cheaper copy is a fraction of what you were paying for everything at the long retention.

What not to cut

Some categories are tempting targets and almost always wrong to touch:

Error rates and latency SLOs. These are the metrics that page your on-call. Low cardinality, high signal, do not touch.
Audit logs. Required by compliance, required by incident response. Move to cheaper storage if necessary but do not reduce retention below regulatory minimums.
Security-relevant logs. CloudTrail, authentication events, authorization denials. These are the logs that matter when it matters. Separate budget, separate decision.

The cuts that save money without losing signal are almost entirely in the “we turned this on once and never looked at it” category. The cuts that look tempting but damage the program are almost always in the “we rely on this daily but don’t remember why” category. Identifying which is which is the work.

What to change this month

Run a cardinality report on your custom metrics. Cut the top 10.
Audit log levels across all production services. Anything emitting DEBUG without a named owner gets raised to INFO.
Separate access logs from error logs. Route access to S3; keep errors in the APM vendor.
Enable tail-based sampling. Drop head-based sampling to 1–5%.
Review retention settings against what responders actually query. Cut accordingly.

A well-run observability program costs meaningfully less than the infrastructure it observes. When it costs more, the cause is almost always operational drift, not technical necessity — and the fix is a one-month project, not a year.

The observability bill problem nobody wants to solve

How the bill got that large

1. Custom metric cardinality

2. Log ingestion at debug level in production

3. APM trace sampling set to 100%

4. Retention defaults nobody reviewed

The triage order

First — audit custom metric cardinality

Second — separate error logs from access logs

Third — reduce APM sampling to what you actually review

Fourth — set retention to actual investigation windows

What not to cut

What to change this month

Run a free audit to see if these patterns apply.

Continue reading.

Three cloud cost anti-patterns that survive every FinOps review

Hardening GitHub Actions OIDC trust policies on AWS

The SOC 2 evidence tax: why your control evidence is eating your engineering calendar