Data StrategyIntermediate7 min read

Data Pipeline Testing

Data pipeline testing is the discipline of validating that your pipelines produce correct, complete, and trustworthy data — before consumers see it. Unlike software unit tests (which validate code), data tests validate the data itself: row counts, null rates, schema, referential integrity, business rules, anomaly detection. dbt tests, Great Expectations, and Soda Core are the dominant frameworks. The hard truth: most data pipelines have between 0 and 5 tests in production, and most failures are detected by an angry executive seeing a wrong number on a dashboard. Engineering teams that ship 80% test-coverage code routinely ship 0% test-coverage data pipelines and act surprised when data quality is bad.

Also known asData TestsPipeline TestsData Quality TestsETL Testing

Challenge a friend Browse library

The Trap

The trap is testing only schema (column types and nullability) and calling it done. Schema tests catch about 20% of real-world data quality issues. The other 80% are logical: 'revenue is negative,' 'customer_id appears in 50% of rows when it should be 100%,' 'today's row count is 30% lower than yesterday's,' 'duplicate primary keys silently overwrote.' These require business-rule tests and freshness/volume anomaly tests. Teams that test only schema get a false sense of security and ship broken data confidently.

What to Do

Adopt a tiered testing strategy: (1) Schema tests on every model — column types, nullability, primary key uniqueness. (2) Business rule tests on critical models — revenue ≥ 0, valid status enums, foreign key integrity. (3) Freshness and volume tests — alert when a daily pipeline produces zero rows or 50% fewer rows than the trailing 7-day average. (4) Data contract tests at producer-consumer boundaries. Run all tests as part of the pipeline; fail loudly. Define a policy for what fails the pipeline (block downstream) vs what alerts (continue but page someone).

Formula

Test Coverage = (Models with ≥3 tests / Total Models) × 100; Test-Catch Rate = (Failures caught by tests / Total failures) × 100. Target: > 70% catch rate.

In Practice

dbt (data build tool) ships with built-in tests for unique, not_null, accepted_values, and relationships. By 2024 it was the de facto SQL transformation framework with hundreds of thousands of users. Great Expectations (founded 2018) extended this with a richer expectation library — column distributions, time-series anomalies, conditional expectations. The combination (dbt for transformation tests + Great Expectations for advanced data quality) is the canonical modern stack. Yet dbt's own community surveys consistently show median test coverage per project is 1-2 tests per model — far below what catches real failures.

Pro Tips

01
Volume anomaly tests are the highest-ROI single test you can add. 'Today's row count is between 70% and 130% of the 7-day median' catches dropped sources, broken joins, and runaway duplicates. Add it to every fact table.
02
Test at the contract boundary, not just the destination. If team A produces a table consumed by teams B, C, D, the producer should run tests that fail their own pipeline if outputs violate contract. Catching downstream is too late.
03
dbt's Calogica wrote dbt-expectations, which ports Great Expectations checks into dbt syntax. For SQL-first teams, this gives you Great Expectations power without the Python infrastructure.

Myth vs Reality

Myth

“Data tests are just like unit tests”

Reality

Unit tests validate deterministic code with deterministic inputs. Data tests validate data that changes constantly with real-world variation. A data test that fires once a year on a real edge case is valuable. A unit test with that signal-to-noise ratio is broken. Different discipline, similar tooling.

Myth

“Testing slows pipelines down too much”

Reality

Tests typically add 5-15% to pipeline runtime. Compared to the cost of one wrong board-deck number, that's trivial. Teams that 'can't afford to add tests' usually can't afford the alternative — they're already paying the cost in incidents, just not measuring it.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your data team ships 200 dbt models and has roughly 150 tests total — almost entirely 'unique' and 'not_null' on primary keys. The CFO discovers a $2M revenue reporting error caused by a silently dropped customer source. What's the single highest-ROI test to add?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Data Test Coverage (% of models with ≥3 tests)

Data teams using dbt or similar transformation frameworks

Elite

> 80%

Good

50-80%

Average

20-50%

Underinvested

< 20%

Source: Hypothetical synthesis from dbt Community Surveys

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🛠️

dbt Labs

2016-2026

mixed

dbt was created by Tristan Handy at RJMetrics (then Fishtown Analytics, now dbt Labs) to bring software engineering practices — tests, version control, modularity — to SQL-based data transformation. Built-in tests (unique, not_null, accepted_values, relationships) made data testing accessible by writing four lines of YAML. By 2024, dbt was the dominant transformation framework with hundreds of thousands of users. Yet community surveys consistently showed median test coverage per project was 1-2 tests per model — far below what's needed to catch most failure modes.

Founded

~2016

Built-in Tests

4 (unique, not_null, accepted_values, relationships)

Median Tests/Model

1-2 (far too low)

Making testing easy is necessary but not sufficient. Most teams stop at 1-2 tests per model. The discipline of writing real business-rule tests requires cultural investment, not just tooling.

Source ↗

✅

Great Expectations

2018-2026

success

Great Expectations (founded 2018 by Abe Gong and James Campbell) extended data testing beyond schema with a rich expectation library: column distributions, time-series anomalies, multi-column expectations, custom Python checks. It became the de facto Python-native data quality framework. Often paired with dbt (for SQL transformations) and Airflow/Dagster (for orchestration). The trio (dbt + Great Expectations + Dagster) is a canonical modern data quality stack.

Founded

2018

Expectations Available

300+

Common Pairing

dbt + Dagster + Great Expectations

Different tests need different tools. SQL-native checks belong in dbt; statistical and distributional checks need a richer framework like Great Expectations. Use the right tool for the right test.

Source ↗

Decision scenario

Building a Data Testing Discipline

You're VP Data at a 300-person company. Last quarter: 18 production data incidents, 4 visible to executives, 1 caused a $400K revenue mis-report. Your team of 12 has been resistant to writing tests because 'it slows us down.' You have one quarter to change this.

Quarterly Incidents

Executive-Visible

dbt Models

180

Test Coverage

~8%

Decision 1

You need to choose a strategy that ships visible improvement in one quarter.

Mandate 5+ tests on every model — 100% coverage by quarter endReveal

Engineering pushes back hard. Velocity tanks. Six weeks in, you've got partial coverage on 60% of models, ~80% of which is low-value boilerplate ('not null' on already-not-null columns). Engineers are checked out. Incident count unchanged.

Coverage: 8% → 35%Test Quality: LowIncidents: Unchanged

Tier the models: identify the 30 'critical' models (feed exec dashboards or finance), require 5 high-value tests on each (volume anomaly, business rules, integrity), accept lower coverage on the restReveal

Feasible in 8 weeks. The 30 critical models account for ~80% of executive-visible incidents. Volume anomaly tests catch 3 incidents in their first month that previously would have surfaced as exec complaints. By quarter end, executive-visible incidents drop from 4 to 1. Engineering buys in because the work targets real pain, not coverage theater.

Critical Model Coverage: 8% → 100%Executive-Visible Incidents: 4 → 1Team Buy-In: Strong

Related concepts