Data Contracts
A Data Contract is a formal, versioned agreement between a data PRODUCER (typically an upstream service, app, or operational system) and DOWNSTREAM CONSUMERS (analytics, ML, ops tools) about the shape, semantics, freshness, and reliability of the data. Concretely: a schema definition + semantic definitions (what does each field actually mean) + SLA (freshness, completeness) + a versioning/deprecation policy + automated enforcement (CI checks, runtime validation). Without contracts, every schema change in a producing service silently breaks downstream pipelines, dashboards, and ML models. Data contracts shift the data-quality battle upstream: producers are explicitly accountable for the data their service emits, and breaking changes require a deprecation cycle.
The Trap
The trap is treating contracts as documentation. A markdown file describing the schema doesn't prevent anything — engineers ship schema changes anyway, the doc gets stale, and downstream pipelines still break. Real data contracts are enforced in CI: a schema change that breaks the contract fails the producer's deploy. The other trap is over-contracting — formalizing every field of every dataset. The cost is enormous and most fields don't need it. Contracts add value where there is a clear producer/consumer dependency that has caused incidents (typically <15% of datasets). Apply them surgically, not universally.
What to Do
Pick the top 5-10 high-value, high-incident datasets where producer schema changes regularly break downstream consumers. For each: (1) Define the schema explicitly (Protobuf, JSON Schema, Avro, or SQL DDL). (2) Define semantic meaning per field. (3) Define freshness/completeness SLA. (4) Add CI enforcement on the producer side (failing builds for breaking changes). (5) Establish versioning and deprecation policy (e.g., '90-day notice for breaking changes'). Measure incident reduction after 90 days. Expand contracts only where they pay off in incident reduction.
Formula
In Practice
Convoy (the digital freight startup) publicly described their Data Contracts implementation in 2022. Before contracts: dozens of monthly incidents where backend service schema changes silently broke ML models and BI dashboards. After deploying contracts (Protobuf schemas + CI enforcement + a contract registry + named producer ownership): incident rate dropped >70%, producer engineers became aware of downstream impact, and breaking changes required explicit deprecation cycles. Convoy's blog series became the canonical public reference for data contract implementation. (GoCardless, Whatnot, and others followed with similar public accounts.)
Pro Tips
- 01
Put data contracts in the producer's repository, not the data team's. The producer must be the owner; if contracts live with consumers, they become a complaints folder. The deploy pipeline of the producing service must enforce them — that's the only thing that changes engineer behavior.
- 02
Version every contract from day one (v1, v2). Breaking changes require a new major version with a 60-90 day deprecation window. Without versioning, you'll either ban all changes (impossible) or keep breaking things (contracts become theater).
- 03
Pair contracts with a 'data product' framing — the producer is shipping a product (the data feed) to internal consumers. This reframing makes contract ownership feel natural rather than imposed. Without the product framing, producers see contracts as friction; with it, they see them as professionalism.
Myth vs Reality
Myth
“Data contracts are only for streaming/Kafka architectures”
Reality
Contracts apply equally to batch warehouse tables, dbt models, API responses, and event streams. The medium varies (Protobuf for events, dbt models with schema tests for warehouse tables, OpenAPI for APIs); the principle is the same — formal schema + semantics + SLA + enforcement. Many of the highest-value contracts are on warehouse tables consumed by ML feature stores.
Myth
“Type checking and schema validation = data contract”
Reality
Schema is one part. A contract also includes semantic meaning ('user_id' is the auth user, NOT the device user), SLA (freshness, completeness), and a deprecation policy. Schema-only 'contracts' catch type bugs but miss semantic regressions (the field still exists, but its meaning silently changed) — which are the most damaging incidents because they're invisible.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
A backend engineer renames the field 'user_status' to 'account_status' in their service. Three downstream pipelines and an ML model break overnight. The data team scrambles to fix. What is the right systemic fix?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Schema Change Incident Rate (Pre-Contracts vs Post)
Mid-to-large engineering orgs with active downstream data dependenciesMature contract program
<10% of changes cause incidents
Partial coverage
10-25%
Documentation only
30-50%
No discipline
50-75%
Crisis state
>75%
Source: https://medium.com/convoy-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Convoy
2021-2022
Convoy publicly led the data contracts movement after years of incidents from backend schema changes silently breaking ML models and analytics. They implemented Protobuf schemas as contracts, registered in a central registry but owned by producing services, with CI enforcement that failed deploys on breaking changes. Versioning required formal deprecation cycles. Their engineering blog series (Chad Sanderson and team) became the most influential public reference on data contracts and reshaped industry practice in 2022-2023.
Schema Format
Protobuf
Enforcement
Producer CI
Incident Reduction
>70%
Public Influence
Standard reference
Producer-owned + CI-enforced contracts work. The cultural shift takes time, but the technical pattern is now well-documented and replicable.
GoCardless
2022-present
GoCardless implemented data contracts to address breaking-change incidents in their financial pipelines, where data accuracy is regulated. They deployed schema-as-code in producing services, enforced via CI, with mandatory semantic descriptions per field. They publicly described the cultural change required: convincing backend engineers that their service's data emissions are part of the product they ship, not exhaust. Outcome: significantly reduced incident volume in regulated financial pipelines and clearer ownership across the data lifecycle.
Driver
Regulated financial data accuracy
Implementation
Schema-as-code + CI
Cultural Change
Backend owns data emissions
Outcome
Reduced incident volume
In regulated industries, contracts shift from 'nice to have' to compliance-grade. The producer-as-owner cultural change is the durable win.
Hypothetical: 1,200-person FinTech
2023
A fintech tried to roll out data contracts top-down: created a central 'contracts team', built a 200-page governance document, and asked all backend services to submit their schemas for review. Backend engineering leadership refused to let a separate team gate their deploys. After 9 months, only 4 services had submitted contracts, no enforcement was in place, and incidents continued at the same rate. The contracts team was disbanded; the document is unread. The technically-equivalent pattern (producer-owned + CI-enforced + scoped to top 5-10 services) would have likely succeeded.
Centralized Team Approach
Failed
Services Onboarded
4 of 80
Enforcement
None
Incident Rate Change
None
Contracts must be producer-owned. Centralized contract teams that gate deploys lose every political battle and produce no enforcement.
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn Data Contracts into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn Data Contracts into a live operating decision.
Use Data Contracts as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.