Monitoring & Observability

Coverage gaps, alert quality, log strategy, SLO tracking, MTTD and MTTR analysis. The instrumentation that tells you there is a problem, and the signal-to-noise ratio that determines whether anyone acts on it.

Audit category · 08 of 08

Scope

What this audit covers

Observability is the connective tissue between everything else in the CloudCheck 360°: it is how you know that security controls are holding, cost optimizations are not breaking things, and the DR plan has not drifted. A healthy observability posture reduces the time-to-detect and the time-to-recover for every other category's incidents.

Metrics

Infrastructure and application metrics, cardinality hygiene, retention cost, custom-metric discipline, RED / USE method coverage.

Logging

Structured vs unstructured, retention per log class, PII / secrets in logs, audit-log separation, cost per GB ingested, cold-tier strategy.

Tracing

Distributed-tracing coverage, sampling strategy, context propagation across service boundaries, trace-log correlation.

Alerting

Alert rule inventory, signal-to-noise ratio, alert fatigue metrics, pager routing, runbook linkage, severity discipline.

SLI / SLO / error budgets

Whether SLOs exist, whether they are user-centric, whether error budgets drive real decisions or just decorate dashboards.

Synthetics & RUM

Synthetic probe coverage for critical journeys, real-user-monitoring deployment, Core Web Vitals tracking, front-end error capture.

Why it matters

Instrumentation is leverage

A well-instrumented system is cheaper to run, safer to change, and easier to hand to new engineers. A poorly instrumented system absorbs on-call effort, hides regressions until customers report them, and makes every post-incident review partly speculative.

Observability also often becomes one of the top three line items in a cloud bill — second only to compute and storage in many mid-to-large environments. The audit captures both sides: coverage gaps that cost you in incidents, and retention / cardinality excess that costs you in invoices.

Method

How we assess it

Three parallel angles: coverage, quality, and economics.

Angle A

Coverage map

For every critical service: which golden signals are instrumented, which are not, whether logs / metrics / traces are correlated. Gaps surface visually on a heatmap.

Angle B

Alert quality

Pull alert-firing history. Classify each alert as actionable, informational, or noise. Measure noise ratio and page distribution by team.

Angle C

Observability economics

Ingestion and retention costs broken down by source, service, and class. Identify the 20 percent of sources driving 80 percent of cost; validate against value.

Tool-agnostic: Datadog, New Relic, Grafana Cloud, Honeycomb, CloudWatch / Azure Monitor / Cloud Operations, OpenTelemetry-native. We work in what you have.

Deliverables

What you get

Coverage heatmap — per service, per signal type, scored and color-coded. Instant visibility into blind spots.
Alert-quality scorecard — noise ratio, actionability rate, per-team page distribution, top-ten noisiest alerts with rewrite suggestions.
SLO recommendation set — three to five SLOs per critical user journey, with SLIs, thresholds, and error-budget policy.
Log-strategy plan — retention tiers per log class, cold-storage path, PII redaction gaps, audit-log separation proposal.
Observability cost model — current spend decomposed by source, projected cost after recommended changes, break-even on any proposed consolidation.
MTTD / MTTR baseline — measured, not estimated. The number you can improve against.

Patterns

Common findings

Alerts on infrastructure metrics, not user-visible symptoms.

CPU, memory, disk alarms that fire when nothing is wrong for the user; no alarm when latency spikes for actual traffic. Rebuild alerts around RED metrics on user-facing endpoints.

Retention set to "whatever the default is," applied to all logs equally.

Verbose debug logs kept for 90 days, audit logs kept for 30. Tiering retention per log class (audit long and immutable, application short, debug shortest) often cuts ingestion cost by a third while improving compliance posture.

No SLOs, or SLOs nobody references.

Dashboards with uptime percentages that nobody uses to make decisions. Without error-budget policy, SLOs are decoration. Fix: start with three user-journey SLOs and an explicit policy for what happens when the budget is exhausted.

Trace coverage partial; correlation broken.

Traces propagate across some service hops and drop at others. Logs and traces share no correlation ID. Debugging requires stitching context together manually.

Alert fatigue measurable and ignored.

Ten high-priority pages a night, three actionable. Oncall is exhausted; real alerts get missed. Strict severity policy plus aggressive auto-resolve plus runbook-linked alerts move the needle faster than more tooling.

Secrets in logs.

Authorization headers, API keys, or PII making it into application logs. Enforcement at log-ingestion time (redaction policies) rather than relying on developer vigilance.

FAQ

Questions we get asked

Should we consolidate on one observability platform? +

Not necessarily. Best-of-breed stacks (Grafana for metrics, Honeycomb for traces, a purpose-built log platform) often beat single-vendor lock-in on both cost and capability. We make a recommendation based on your signal profile, team size, and existing commitments.

Will this reduce our Datadog / New Relic bill? +

Almost always, yes — though reducing bill is not the primary goal. The audit tends to find 20 to 40 percent of ingestion spend going to sources nobody references. Retention tiering, cardinality hygiene, and unused-metric cleanup compound from there.

Can we define SLOs if we do not have a lot of traffic? +

Yes. Low-traffic services use time-based SLIs (percentage of minutes with at least one successful request) rather than request-based SLIs. The SLO framework still applies.

Is OpenTelemetry worth adopting now? +

For new services, yes — OTel is the de facto standard and avoids vendor lock-in on the collection side. For existing services, migration is gradual and usually not the highest-ROI change unless you are also swapping backends.

Who owns this after you leave? +

SLOs, alert policy, and retention tiers are delivered as config-as-code (Terraform / Datadog Monitor JSON / CloudWatch alarm definitions). Your platform team owns and iterates them after handover.

References: Google SRE Book — SLOs, OpenTelemetry. See also Cost Optimization & FinOps, Backup & Disaster Recovery, and CI/CD & DevSecOps.

Start with a free Cloud Health Check.

A scoped-down CloudCheck 360° of your current environment. Delivered in five business days, no commitment.

Book a Health Check See engagement packages