Data Catalog
A Data Catalog is the searchable inventory of every meaningful dataset in the company — table by table, column by column — enriched with business context (owner, definition, freshness, quality, lineage, certification). Modern catalogs (Atlan, Alation, Collibra, Microsoft Purview, OpenMetadata) auto-crawl warehouses, BI tools, and pipelines to keep metadata current, then layer on Google-style search so an analyst can type 'monthly active users' and find the certified table in 5 seconds instead of asking in Slack and waiting 2 days. The honest test of a catalog is not 'do we have one' but 'when an analyst joins, do they self-serve discovery, or still pull a senior engineer into a 30-minute Loom?' If it's the latter, the catalog is a wiki, not a catalog.
The Trap
The trap is treating the catalog as a documentation project — a one-time push to fill in 5,000 table descriptions, declare victory, then watch the metadata rot within 90 days as schemas drift. Manually maintained catalogs always lose to entropy. The second trap is buying a $200K/year enterprise catalog and giving it to a 2-person 'metadata team' to populate by hand; you've bought a Ferrari and put it in a garage. The third trap, KnowMBA POV: companies obsess over coverage breadth (every table cataloged) when consumption depth (analysts actually using it weekly) is the only metric that matters. A catalog with 100 tables that 80% of analysts use weekly beats a catalog with 10,000 tables that 5% of analysts have ever opened.
What to Do
Start narrow. Step 1: pick a tool with auto-crawling and SQL parsing (Atlan, OpenMetadata, Microsoft Purview if you're on Azure). Step 2: launch with the top 50-100 production tables — the ones in 80% of dashboards. Step 3: enforce the 'certified dataset' workflow: a dataset is searchable as 'certified' only after it has named owner + definition + freshness SLA + quality checks. Step 4: integrate catalog search into the analyst workflow — Slack bot, BI tool sidebar, IDE plugin. Step 5: publish weekly metrics: search queries, certification coverage, % of new questions answered without escalation. Step 6: deprecate non-certified datasets that are still queried — force the team to either certify or migrate consumers off.
Formula
In Practice
Atlan is the canonical case study. Their customer base (Postman, Plaid, Hubspot, hundreds of mid-to-large data orgs) treats the catalog as the 'control plane' for the data team. Atlan publishes that customers typically reach 4-7x analyst self-serve rates within 6 months once catalog search is integrated into Slack and the BI tool. The decisive change is workflow integration: when 'where is the customer revenue table?' gets answered by typing into Slack instead of pinging a senior engineer, the catalog becomes a habit. When it requires opening a separate UI, it dies.
Pro Tips
- 01
Catalog adoption is a Slack-bot problem, not a UI problem. Analysts don't context-switch to a separate tool to answer a question; they ask in Slack. The catalog that wins is the one whose search results show up directly in the channel where the question was asked.
- 02
Measure 'time-to-first-trusted-table' for new analyst hires. Before catalog: 3-6 weeks of asking around. After good catalog: 2-3 days. This single metric justifies most catalog investments to a CFO without needing soft ROI math.
- 03
Auto-extract everything you can (table names, column types, SQL lineage, dbt descriptions, BI usage stats) before asking humans to write descriptions. Human-written metadata should be the last 20%, not the first 80%. Reverse this and the project will collapse under its own weight.
Myth vs Reality
Myth
“We need a data catalog before we do anything else with our data”
Reality
If you have under 50 production tables and a 5-person data team that knows them all, a catalog is overhead. Catalogs become essential around 100+ tables and 20+ data consumers — the tipping point where tribal knowledge breaks down. Premature catalog adoption is a top-3 cause of failed metadata initiatives.
Myth
“A wiki (Confluence, Notion) is a cheap data catalog”
Reality
Wikis are static. Schemas change weekly. Within 90 days, every Confluence data dictionary is wrong about 30%+ of its content, and analysts learn to distrust it. A real catalog auto-syncs metadata from the warehouse on every change. The cost difference between a wiki and a catalog is real, but so is the trust difference — and trust is what makes the catalog get used.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
A 600-person company bought an enterprise data catalog 12 months ago. They spent $180K on the license and 6 person-months filling in metadata. Today, weekly active users of the catalog is ~15 (mostly the data team itself). Analysts still ask in Slack 'which table has revenue?' What is the most likely root cause?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Catalog Adoption (% of data consumers active weekly)
Mid-to-large enterprises (Atlan, Alation, OpenMetadata customer benchmarks)Best-in-class (workflow-integrated)
60-80%+
Good
30-60%
Average
15-30%
Shelfware
<15%
Source: https://atlan.com/active-metadata/
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Atlan
2020-present
Atlan built the active metadata category by treating the catalog as a workflow tool — deep integrations with Slack, dbt, Snowflake, BigQuery, Looker, Tableau, and Mode. Customers like Postman, Plaid, and Hubspot publish that catalog-driven self-serve rates 4-7x within 6 months once Slack and BI integrations are live. The decisive product insight: analysts don't change tools to find data; they search where they already work. Atlan's growth (Series C, $750M+ valuation) is essentially proof that workflow integration is the catalog category's only durable differentiator.
Customer Self-Serve Lift
4-7x in 6 months
Primary Differentiator
Slack + dbt + BI workflow integration
Funding
Series C, $750M+ valuation
Notable Customers
Postman, Plaid, Hubspot, Autodesk
The catalog category is being won by workflow integration, not metadata depth. Buy the catalog that integrates with the tools your analysts already live in.
Microsoft Purview
2021-present
Microsoft Purview (formerly Azure Purview) became the de facto enterprise catalog for Azure-heavy organizations by being bundled with E5 licenses and integrating natively with Synapse, Fabric, and Power BI. Adoption has been strong in regulated industries (financial services, healthcare, public sector) where Microsoft is already the strategic vendor. The trade-off is breadth (full Microsoft stack coverage) vs depth (less polished UI than dedicated competitors like Atlan). For organizations standardized on Microsoft, the bundle economics are nearly impossible for a standalone catalog to beat.
Native Integrations
Synapse, Fabric, Power BI, Purview governance
Bundling
Included in E5 / paid Azure tiers
Sweet Spot
Regulated, Microsoft-standardized enterprises
Key Trade-off
Breadth vs UX polish
Catalog selection often comes down to ecosystem alignment more than feature parity. The 'best' catalog is usually the one that comes with the warehouse and BI stack you've already standardized on.
Hypothetical: Mid-Market Retailer
2022-2023
A 600-person retailer bought Collibra for $250K/year and assigned a 2-person metadata team to populate it. After 14 months: 8,000 tables cataloged, weekly active users averaged 11 (mostly the metadata team itself). No Slack integration, no BI sidebar, no dbt linkage — analysts continued asking in Slack. The CFO de-funded renewal. Total spent: $290K including services, with no measurable change in analyst productivity. The lesson the team wrote up internally: 'we bought a catalog, not a workflow.'
Annual License + Services
$290K
Tables Cataloged
8,000
Weekly Active Users
~11
Workflow Integration
None
Coverage is the vanity metric of data catalogs. Adoption is the real one. A 200-table catalog used by 80% of analysts beats an 8,000-table catalog used by 5%.
Decision scenario
Choosing the Catalog That Sticks
You're VP of Data at a 1,200-person SaaS company. Snowflake-based stack, dbt for transformations, Looker for BI, ~250 analysts and engineers consuming data. The CFO has approved $250K/year for a catalog. You're choosing between Atlan ($220K with deep Slack + dbt + Looker integration), Microsoft Purview ($60K incremental on existing E5, but weaker non-Microsoft integrations), and OpenMetadata (open source, $0 license but ~$200K of integration engineering).
Data Consumers
~250
Production Tables
~1,800
Approved Budget
$250K/year
Stack
Snowflake + dbt + Looker + Slack
Current State
Tribal knowledge + Confluence
Decision 1
All three are technically capable. The decision is really about adoption mechanics. You have one shot — a failed catalog rollout typically poisons the well for 2-3 years before the org will fund another attempt.
OpenMetadata. Free license, full control, build the integrations in-house with engineering time.Reveal
Microsoft Purview. Cheap incremental cost, leverages existing E5 license.Reveal
Atlan. Higher license cost but deep native integrations with Snowflake, dbt, Looker, and Slack — the exact stack your analysts use.✓ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn Data Catalog into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn Data Catalog into a live operating decision.
Use Data Catalog as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.