K
KnowMBAAdvisory
Digital TransformationIntermediate8 min read

Observability Strategy

Observability is the practice of instrumenting systems to make their internal state knowable from external outputs โ€” the three classical signals are metrics (numerical time-series), logs (timestamped events), and traces (distributed request flow). The major commercial platforms are Datadog (broadest, most-expensive), New Relic, Splunk, Dynatrace (APM-centric), Honeycomb (event-based, BubbleUp methodology), and Grafana Cloud (open-source-aligned). The open-source stack centers on Prometheus + Grafana + Loki + Tempo + OpenTelemetry. OpenTelemetry (CNCF, second-largest project after Kubernetes) has become the standard instrumentation framework, allowing organizations to decouple instrumentation from backend. The KnowMBA POV: observability without ownership is just storage cost. Most enterprises buy Datadog or Splunk, ingest everything, build a few dashboards, and discover that nobody actually USES the platform during incidents โ€” engineers grep logs in their terminals because they can't navigate the platform fast enough. The discipline missing isn't tooling; it's service ownership, on-call rigor, SLO definition, and the practice of actually using observability data to drive decisions.

Also known asObservability Platform StrategyO11y StrategyTelemetry StrategyLogs/Metrics/Traces StrategyOpenTelemetry Adoption

The Trap

The trap is treating observability as a tooling decision rather than a practice decision. Three failure modes dominate: (1) Cost explosion. Datadog, Splunk, and similar charge by ingest volume. Without retention discipline, sampling, and selective instrumentation, observability bills grow 40-80% year-over-year and eventually rival the infrastructure they monitor. Datadog has multiple seven-figure customer stories and has faced public criticism over surprise bills (most famously the Coinbase $65M Datadog incident reportedly described in industry chatter). (2) Dashboard graveyard. Teams build hundreds of dashboards, but ownership is unclear, half are stale, and during incidents nobody knows which ones to look at. (3) No actual SLOs. Teams have metrics but no formal service-level objectives, no error budgets, no decision framework for when to ship features vs. invest in reliability. The deeper trap: confusing data quantity with insight. Ingesting 10x more telemetry doesn't make systems 10x more observable โ€” it makes them more expensive and harder to navigate.

What to Do

Six moves. (1) Standardize on OpenTelemetry for instrumentation โ€” this decouples your code from your backend choice and avoids vendor lock-in. Whether you ship to Datadog, Honeycomb, or Grafana, OTel is the unification layer. (2) Define SLOs per service before adding more dashboards. Without SLOs, observability is decoration. The Google SRE workbook chapters on SLOs are the canonical reference. (3) Set per-team observability budgets (cost AND signal volume) โ€” make teams accountable for what they ingest. Datadog's per-team usage attribution exists for this reason. (4) Implement aggressive sampling and retention policies โ€” full-fidelity for recent data (1-7 days), heavy sampling for older data, archival to cold storage for compliance. Most observability platforms support tail-based sampling for traces. (5) Build a small set of high-quality runbook-linked dashboards rather than hundreds of one-off dashboards. Dashboard quality > quantity. (6) Hold blameless postmortems that explicitly evaluate observability โ€” for every incident, ask 'what would have made this faster to detect or resolve?' and feed those learnings back into instrumentation.

Formula

Observability ROI โ‰ˆ (MTTR Reduction ร— Incident Frequency ร— Cost per Incident) โˆ’ (Platform Cost + Instrumentation Engineering Time + Cognitive Overhead from Dashboard Sprawl)

In Practice

Datadog has become the dominant enterprise observability platform โ€” public market cap ~$45B as of 2024, ~$2.5B in revenue, and used at thousands of enterprises. The platform's growth and customer concentration produced industry-wide cost concerns: cases of customer Datadog bills exceeding $10M/year are common at large engineering organizations, and one widely-discussed (though never officially confirmed) story circulated about Coinbase incurring a multi-million-dollar Datadog bill from a single misconfigured ingestion pipeline. Honeycomb (Charity Majors, Christine Yang) built a counter-narrative around event-based observability and the limitations of metrics-first approaches, becoming influential in the observability discourse without the ingest-everything cost model. Grafana Labs built the largest open-source-aligned observability platform (Prometheus + Grafana + Loki + Tempo + Mimir), reaching unicorn valuation by 2022. The market converged on a clearer view: observability tooling is excellent but operationally and financially heavy, and the discipline (SLOs, ownership, sampling) matters more than the choice of vendor.

Pro Tips

  • 01

    Datadog cost grows faster than infrastructure. Track 'observability cost as % of total infrastructure cost' as a KPI. Healthy: 5-15%. Concerning: 20-30%. Crisis: 30%+. Many organizations discover their Datadog bill is approaching parity with their AWS bill โ€” at which point either the observability budget is wrong, the instrumentation is too verbose, or the platform choice is wrong for the workload.

  • 02

    OpenTelemetry adoption is the highest-leverage observability decision you can make. OTel decouples instrumentation from backend, meaning you can switch from Datadog to Grafana Cloud, or to Honeycomb, or run a hybrid โ€” without re-instrumenting your code. The investment in OTel pays for itself the first time you renegotiate with your observability vendor.

  • 03

    SLOs change observability from monitoring to decision-making. A team without SLOs has a wall of dashboards and no opinion about what's good or bad. A team with SLOs has a clear boundary: when error budget is healthy, ship features; when it's burning, invest in reliability. The Google SRE Workbook's chapters on SLOs and error budgets are foundational reading.

Myth vs Reality

Myth

โ€œMore telemetry equals better observabilityโ€

Reality

Ingest volume is not the same as insight. Many organizations ingest terabytes of logs that nobody reads, generate millions of metric series that no dashboard queries, and produce traces at full fidelity that get sampled at query time anyway. The discipline is selective instrumentation aligned to specific failure modes and decisions, not maximum coverage.

Myth

โ€œBuying Datadog (or any platform) solves observabilityโ€

Reality

Datadog (Splunk, New Relic, Dynatrace) is excellent platform tooling โ€” that doesn't address the practice gap. SLO definition, on-call rigor, blameless postmortems, runbook discipline, dashboard ownership, sampling policies โ€” these are the work, and no platform purchase replaces them. The platform makes the practice scalable; the practice has to exist first.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Scenario Challenge

Your CFO sends you a panicked email: the Datadog bill went from $80K/month last year to $290K/month this quarter, with no commensurate growth in infrastructure. Engineering leadership defends the spend: 'we need this visibility for production reliability.' What's the right diagnostic and response?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

Observability Spend as % of Infrastructure Cost

Cloud-native production environments

Lean

< 8%

Healthy

8-15%

High

15-25%

Out of Proportion

> 25%

Source: Hypothetical: composite from FinOps Foundation observations and platform vendor case studies

Mean Time to Recovery (MTTR) โ€” Mature Practice

DORA (Accelerate State of DevOps) MTTR tiers

Elite (DORA top performers)

< 1 hour

High

1-24 hours

Medium

1 day - 1 week

Low

> 1 week

Source: DORA Accelerate State of DevOps Report (annual)

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐Ÿถ

Datadog

2010-Present

mixed

Datadog launched in 2010 and grew into the dominant enterprise observability platform โ€” public market cap ~$45B as of 2024 and revenue ~$2.5B annually. Datadog's growth came from a comprehensive product (APM + infrastructure monitoring + log management + RUM + security) and aggressive sales motion at enterprises. The flip side: customer cost concerns. Industry-wide reports of Datadog bills exceeding $10M annually became common at larger engineering organizations, and the platform faced criticism about pricing model complexity (per-host, per-custom-metric, per-log-GB, per-trace-event, per-RUM-session โ€” each with different rates and unit economics). One widely-discussed (though never officially confirmed) story circulated in 2022-2023 about a major crypto company incurring a multi-million-dollar Datadog bill from misconfigured ingestion. The pattern industry-wide: Datadog delivers excellent capability and excellent margins simultaneously, and customer cost discipline must be operated as actively as feature adoption.

Founded

2010 (NYC)

Market Cap (2024)

~$45B

Annual Revenue

~$2.5B

Common Enterprise Bill

$1M-$10M+/year

Pricing Complexity

Multiple per-unit dimensions

Excellent observability tooling can produce excellent vendor margins through cost-model complexity. Organizations that deploy Datadog without per-team cost attribution, sampling discipline, and ingest budgeting routinely discover their bills grew faster than infrastructure. The platform's value is real; the cost discipline is mandatory.

Source โ†—
๐Ÿฏ

Honeycomb

2016-Present

success

Honeycomb (founded by Charity Majors and Christine Yang, both ex-Facebook) built an observability platform around 'observability 2.0' principles: high-cardinality event-based data rather than pre-aggregated metrics, BubbleUp methodology for outlier detection, focus on understanding distributed systems behavior rather than monitoring known unknowns. Honeycomb's commercial scale is much smaller than Datadog's, but its intellectual influence has been disproportionate โ€” the books 'Observability Engineering' and 'Modern Software Engineering' published from this community shaped how a generation of senior engineers think about telemetry. Charity Majors's writing on production excellence, ownership, and operational practice is canonical in the SRE community.

Founded

2016 (San Francisco)

Founders

Charity Majors, Christine Yang (both ex-Facebook)

Commercial Scale

Sub-Datadog, but profitable

Intellectual Influence

'Observability Engineering' (O'Reilly)

Observability is as much a thinking practice as a tooling category. Honeycomb's contribution was less commercial than intellectual: reframing observability around exploration of unknown failure modes rather than monitoring of known metrics. The mature engineering organization adopts the thinking even when the tool is something else.

Source โ†—

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn Observability Strategy into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn Observability Strategy into a live operating decision.

Use Observability Strategy as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.