K
KnowMBAAdvisory
Data StrategyAdvanced7 min read

Data Contracts

A Data Contract is a formal, versioned agreement between a data PRODUCER (typically an upstream service, app, or operational system) and DOWNSTREAM CONSUMERS (analytics, ML, ops tools) about the shape, semantics, freshness, and reliability of the data. Concretely: a schema definition + semantic definitions (what does each field actually mean) + SLA (freshness, completeness) + a versioning/deprecation policy + automated enforcement (CI checks, runtime validation). Without contracts, every schema change in a producing service silently breaks downstream pipelines, dashboards, and ML models. Data contracts shift the data-quality battle upstream: producers are explicitly accountable for the data their service emits, and breaking changes require a deprecation cycle.

Also known asSchema ContractsProducer-Consumer ContractsData API ContractsData SLAs

The Trap

The trap is treating contracts as documentation. A markdown file describing the schema doesn't prevent anything — engineers ship schema changes anyway, the doc gets stale, and downstream pipelines still break. Real data contracts are enforced in CI: a schema change that breaks the contract fails the producer's deploy. The other trap is over-contracting — formalizing every field of every dataset. The cost is enormous and most fields don't need it. Contracts add value where there is a clear producer/consumer dependency that has caused incidents (typically <15% of datasets). Apply them surgically, not universally.

What to Do

Pick the top 5-10 high-value, high-incident datasets where producer schema changes regularly break downstream consumers. For each: (1) Define the schema explicitly (Protobuf, JSON Schema, Avro, or SQL DDL). (2) Define semantic meaning per field. (3) Define freshness/completeness SLA. (4) Add CI enforcement on the producer side (failing builds for breaking changes). (5) Establish versioning and deprecation policy (e.g., '90-day notice for breaking changes'). Measure incident reduction after 90 days. Expand contracts only where they pay off in incident reduction.

Formula

Data Contract Effectiveness = (Schema Changes Caught Pre-Production) ÷ (Total Schema Changes). Mature programs hit >90%. Documentation-only 'contracts' typically catch <10%.

In Practice

Convoy (the digital freight startup) publicly described their Data Contracts implementation in 2022. Before contracts: dozens of monthly incidents where backend service schema changes silently broke ML models and BI dashboards. After deploying contracts (Protobuf schemas + CI enforcement + a contract registry + named producer ownership): incident rate dropped >70%, producer engineers became aware of downstream impact, and breaking changes required explicit deprecation cycles. Convoy's blog series became the canonical public reference for data contract implementation. (GoCardless, Whatnot, and others followed with similar public accounts.)

Pro Tips

  • 01

    Put data contracts in the producer's repository, not the data team's. The producer must be the owner; if contracts live with consumers, they become a complaints folder. The deploy pipeline of the producing service must enforce them — that's the only thing that changes engineer behavior.

  • 02

    Version every contract from day one (v1, v2). Breaking changes require a new major version with a 60-90 day deprecation window. Without versioning, you'll either ban all changes (impossible) or keep breaking things (contracts become theater).

  • 03

    Pair contracts with a 'data product' framing — the producer is shipping a product (the data feed) to internal consumers. This reframing makes contract ownership feel natural rather than imposed. Without the product framing, producers see contracts as friction; with it, they see them as professionalism.

Myth vs Reality

Myth

Data contracts are only for streaming/Kafka architectures

Reality

Contracts apply equally to batch warehouse tables, dbt models, API responses, and event streams. The medium varies (Protobuf for events, dbt models with schema tests for warehouse tables, OpenAPI for APIs); the principle is the same — formal schema + semantics + SLA + enforcement. Many of the highest-value contracts are on warehouse tables consumed by ML feature stores.

Myth

Type checking and schema validation = data contract

Reality

Schema is one part. A contract also includes semantic meaning ('user_id' is the auth user, NOT the device user), SLA (freshness, completeness), and a deprecation policy. Schema-only 'contracts' catch type bugs but miss semantic regressions (the field still exists, but its meaning silently changed) — which are the most damaging incidents because they're invisible.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

A backend engineer renames the field 'user_status' to 'account_status' in their service. Three downstream pipelines and an ML model break overnight. The data team scrambles to fix. What is the right systemic fix?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Schema Change Incident Rate (Pre-Contracts vs Post)

Mid-to-large engineering orgs with active downstream data dependencies

Mature contract program

<10% of changes cause incidents

Partial coverage

10-25%

Documentation only

30-50%

No discipline

50-75%

Crisis state

>75%

Source: https://medium.com/convoy-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🚛

Convoy

2021-2022

success

Convoy publicly led the data contracts movement after years of incidents from backend schema changes silently breaking ML models and analytics. They implemented Protobuf schemas as contracts, registered in a central registry but owned by producing services, with CI enforcement that failed deploys on breaking changes. Versioning required formal deprecation cycles. Their engineering blog series (Chad Sanderson and team) became the most influential public reference on data contracts and reshaped industry practice in 2022-2023.

Schema Format

Protobuf

Enforcement

Producer CI

Incident Reduction

>70%

Public Influence

Standard reference

Producer-owned + CI-enforced contracts work. The cultural shift takes time, but the technical pattern is now well-documented and replicable.

Source ↗
💳

GoCardless

2022-present

success

GoCardless implemented data contracts to address breaking-change incidents in their financial pipelines, where data accuracy is regulated. They deployed schema-as-code in producing services, enforced via CI, with mandatory semantic descriptions per field. They publicly described the cultural change required: convincing backend engineers that their service's data emissions are part of the product they ship, not exhaust. Outcome: significantly reduced incident volume in regulated financial pipelines and clearer ownership across the data lifecycle.

Driver

Regulated financial data accuracy

Implementation

Schema-as-code + CI

Cultural Change

Backend owns data emissions

Outcome

Reduced incident volume

In regulated industries, contracts shift from 'nice to have' to compliance-grade. The producer-as-owner cultural change is the durable win.

Source ↗
🏢

Hypothetical: 1,200-person FinTech

2023

failure

A fintech tried to roll out data contracts top-down: created a central 'contracts team', built a 200-page governance document, and asked all backend services to submit their schemas for review. Backend engineering leadership refused to let a separate team gate their deploys. After 9 months, only 4 services had submitted contracts, no enforcement was in place, and incidents continued at the same rate. The contracts team was disbanded; the document is unread. The technically-equivalent pattern (producer-owned + CI-enforced + scoped to top 5-10 services) would have likely succeeded.

Centralized Team Approach

Failed

Services Onboarded

4 of 80

Enforcement

None

Incident Rate Change

None

Contracts must be producer-owned. Centralized contract teams that gate deploys lose every political battle and produce no enforcement.

Related concepts

Keep connecting.

The concepts that orbit this one — each one sharpens the others.

Beyond the concept

Turn Data Contracts into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h · No retainer required

Turn Data Contracts into a live operating decision.

Use Data Contracts as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.