Data StrategyIntermediate8 min read

Data Quality Monitoring

Data Quality Monitoring is the continuous, automated detection of anomalies in data: freshness lapses, volume spikes or drops, schema changes, distribution shifts, and broken referential integrity. Tools like Monte Carlo, Anomalo, Bigeye, and Soda apply ML to baseline 'normal' for each dataset and alert when something deviates. The discipline differs from manual data quality testing in two ways: (1) it covers data you didn't think to test (anomaly detection finds the unknown unknowns), and (2) it runs continuously, not just in CI/CD. KnowMBA POV: most companies invest heavily in pipeline reliability monitoring (did the job run?) and almost nothing in DATA reliability monitoring (was the data the job produced correct?). The latter causes far more silent business damage.

Also known asData Observability MonitoringData Reliability MonitoringDQ Monitoring

Challenge a friend Browse library

The Trap

The trap is treating data quality monitoring as a tooling purchase. Buy Monte Carlo, deploy on 800 tables, get 200 alerts/day, mute everything within 3 weeks. The hard work is curating: which 50 tables actually matter, what defines 'normal' for each, who gets paged, what is the runbook. Without that discipline, the platform creates alert fatigue and the underlying problem (silent data corruption) persists.

What to Do

Roll out monitoring in three waves: (1) Tier-1 only — pick the 20-50 tables that feed executive dashboards, billing, ML models, or public-facing products. Define explicit checks (uniqueness, not-null, freshness, referential integrity) for each. (2) Anomaly detection on tier-1 — turn on ML-based volume/distribution alerts only on the curated set. (3) Tier-2 broader rollout — only after tier-1 alerts are tuned and acted on consistently. Skipping straight to 'monitor everything' is how you get alert fatigue.

Formula

Data Incident MTTD = Avg Time from Data Issue Occurring → Detected by Monitoring System

In Practice

Monte Carlo Data, founded in 2019, pioneered the 'data observability' category by applying SRE-style monitoring concepts to data: freshness, volume, schema, distribution, and lineage as the 'five pillars.' Customers including Fox, Vimeo, and CreditKarma deployed Monte Carlo to detect data incidents before downstream consumers noticed. By 2024, Monte Carlo was joined by Anomalo (ML-first), Bigeye (open standards), and Soda (developer-friendly OSS) — the category became standard for data-mature enterprises, but adoption maturity varies wildly even among companies that bought a tool.

Pro Tips

01
The single highest-impact monitor is FRESHNESS on tier-1 tables. 'The dashboard is showing yesterday's number as today's' is by far the most common silent failure, and it's also the easiest to detect.
02
Severity-tier your alerts. P0 = customer-facing data wrong (page on-call). P1 = exec dashboard wrong (ticket within 1 business hour). P2 = analyst-impacting (ticket within 1 business day). Without severity tiers, every alert is treated the same and the team burns out.
03
Track three metrics monthly: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve), and 'Incidents Caught Before Stakeholder Reported.' The third is the leading indicator that monitoring is actually working.

Myth vs Reality

Myth

“If our pipelines are reliable, our data is reliable”

Reality

Pipeline reliability (did the job complete?) and data reliability (was the output correct?) are different problems. A pipeline can run successfully and produce silently corrupt data — wrong joins, stale upstream source, schema drift the pipeline coerced through. You need separate monitoring for each.

Myth

“Data tests in dbt are sufficient”

Reality

dbt tests catch known failure modes you thought to write. They don't catch schema drift in upstream sources, distribution shifts in input data, or volume anomalies. Tests + observability is the right combo — either alone leaves blind spots.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

A finance dashboard has been showing the wrong revenue number for 11 days. The pipeline ran successfully every day. The error was caught when the CFO noticed in board prep. What is the highest-leverage fix to prevent this class of incident?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Data Incident MTTD (Tier-1 Datasets)

Business-critical data assets with downstream consumers (dashboards, ML, operational systems)

Elite

< 1 hour

Strong

1-8 hours

Acceptable

8 hrs - 2 days

Poor

> 2 days (often stakeholder-reported)

Source: Hypothetical: Monte Carlo State of Data Quality 2024 + KnowMBA practitioner interviews

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🌋

Monte Carlo Data

2019-Present

success

Monte Carlo founded the 'data observability' category in 2019 with a platform built around the five pillars: freshness, volume, schema, distribution, and lineage. By 2024, Monte Carlo had hundreds of enterprise customers including Fox, Vimeo, CreditKarma, and PagerDuty, and the data observability category had grown to ~$500M annually with multiple competitors. Monte Carlo's customer reports consistently show 80%+ reduction in data incident MTTD after deployment.

Founded

2019

Marquee Customers

Fox, Vimeo, CreditKarma, PagerDuty

Typical MTTD Reduction

80%+

Category Size (2024)

~$500M annual

Treating data like a production system — with monitoring, on-call, and incident response — produces measurable reliability gains. The discipline is a force multiplier on every other data investment.

Source ↗

🔬

Anomalo

2018-Present

success

Anomalo took a different approach to data quality: ML-first anomaly detection on every column of every important table, with no manual rule writing. Customers including Block (Square), Discover, and Buzzfeed deployed Anomalo to catch quality issues that rule-based testing missed. The product validated that the 'unknown unknowns' problem in data quality is real and ML-tractable — and that tooling choice should match the team's preferred operating model (rule-heavy vs ML-driven).

Founded

2018

Approach

ML-first, no manual rules

Marquee Customers

Block, Discover, Buzzfeed

Different DQ tools optimize for different operating models. ML-first (Anomalo) is best when you have massive table coverage and no time to write rules; rule-first (Soda, Bigeye) is best when you want explicit, auditable controls. Choose based on team preference, not vendor claims.

Source ↗

Decision scenario

Data Quality Monitoring Rollout

You're head of data at a 600-person company. The CFO just discovered a 9-day-old reporting error that misled board prep. CEO wants 'this to never happen again.' You're evaluating Monte Carlo, Anomalo, and Soda. Annual budget approved: $200K. Your team has 4 data engineers + 6 analytics engineers. You have ~1,200 tables in your warehouse, of which ~80 are 'tier-1.'

Total Tables

1,200

Tier-1 Tables

Current MTTD

~7 days (stakeholder-reported)

Approved Budget

$200K/year

DE + AE Team Size

Decision 1

Your VP of Engineering wants you to deploy on all 1,200 tables to maximize coverage. Your most senior data engineer says start with just the 80 tier-1 tables and tune carefully. The vendor sales engineer says they can deploy on all 1,200 in week one.

Maximum coverage — deploy on all 1,200 tables in the first month for breadthReveal

Within 3 weeks, the team is receiving 150+ alerts per day. Most are noise (legitimate business variation flagged as anomalies). Real incidents get lost. By month 2, the team has muted entire categories of alerts and started ignoring the system. The CFO's revenue dashboard quietly breaks again at month 4 — and the alert that should have caught it was muted along with the noise.

Daily Alert Volume: 0 → 150+ → mutedMTTD: ~7 days (unchanged)Team Trust in Tooling: Destroyed

Tier-1 first — deploy on 80 tables with explicit checks + tuned anomaly detection; expand only after 90 days of clean signalReveal

Month 1: 80 tables instrumented, ~15 alerts/day, team triages and tunes. Month 2: alert volume down to 5/day, all real. Month 3: MTTD drops to <8 hours on tier-1. Two real incidents caught and fixed before stakeholders noticed. CFO dashboard never breaks again. Month 4: expand to tier-2 (next 200 tables) with the tuning playbook learned. By end of year, 600 tables monitored with sustainable signal.

Tier-1 MTTD: 7 days → <8 hoursDaily Alert Volume: Sustainable (~5)Stakeholder-Reported Incidents: ↓ 80%

Related concepts