K
KnowMBAAdvisory
Data StrategyIntermediate7 min read

Data Catalog

A Data Catalog is the searchable inventory of every meaningful dataset in the company — table by table, column by column — enriched with business context (owner, definition, freshness, quality, lineage, certification). Modern catalogs (Atlan, Alation, Collibra, Microsoft Purview, OpenMetadata) auto-crawl warehouses, BI tools, and pipelines to keep metadata current, then layer on Google-style search so an analyst can type 'monthly active users' and find the certified table in 5 seconds instead of asking in Slack and waiting 2 days. The honest test of a catalog is not 'do we have one' but 'when an analyst joins, do they self-serve discovery, or still pull a senior engineer into a 30-minute Loom?' If it's the latter, the catalog is a wiki, not a catalog.

Also known asActive Metadata PlatformData Discovery CatalogEnterprise Data CatalogMetadata ManagementData Inventory

The Trap

The trap is treating the catalog as a documentation project — a one-time push to fill in 5,000 table descriptions, declare victory, then watch the metadata rot within 90 days as schemas drift. Manually maintained catalogs always lose to entropy. The second trap is buying a $200K/year enterprise catalog and giving it to a 2-person 'metadata team' to populate by hand; you've bought a Ferrari and put it in a garage. The third trap, KnowMBA POV: companies obsess over coverage breadth (every table cataloged) when consumption depth (analysts actually using it weekly) is the only metric that matters. A catalog with 100 tables that 80% of analysts use weekly beats a catalog with 10,000 tables that 5% of analysts have ever opened.

What to Do

Start narrow. Step 1: pick a tool with auto-crawling and SQL parsing (Atlan, OpenMetadata, Microsoft Purview if you're on Azure). Step 2: launch with the top 50-100 production tables — the ones in 80% of dashboards. Step 3: enforce the 'certified dataset' workflow: a dataset is searchable as 'certified' only after it has named owner + definition + freshness SLA + quality checks. Step 4: integrate catalog search into the analyst workflow — Slack bot, BI tool sidebar, IDE plugin. Step 5: publish weekly metrics: search queries, certification coverage, % of new questions answered without escalation. Step 6: deprecate non-certified datasets that are still queried — force the team to either certify or migrate consumers off.

Formula

Catalog Value = Coverage of Critical Datasets × Workflow Integration × Freshness × Trust (% certified). All four must be > 0; the lowest term caps the value.

In Practice

Atlan is the canonical case study. Their customer base (Postman, Plaid, Hubspot, hundreds of mid-to-large data orgs) treats the catalog as the 'control plane' for the data team. Atlan publishes that customers typically reach 4-7x analyst self-serve rates within 6 months once catalog search is integrated into Slack and the BI tool. The decisive change is workflow integration: when 'where is the customer revenue table?' gets answered by typing into Slack instead of pinging a senior engineer, the catalog becomes a habit. When it requires opening a separate UI, it dies.

Pro Tips

  • 01

    Catalog adoption is a Slack-bot problem, not a UI problem. Analysts don't context-switch to a separate tool to answer a question; they ask in Slack. The catalog that wins is the one whose search results show up directly in the channel where the question was asked.

  • 02

    Measure 'time-to-first-trusted-table' for new analyst hires. Before catalog: 3-6 weeks of asking around. After good catalog: 2-3 days. This single metric justifies most catalog investments to a CFO without needing soft ROI math.

  • 03

    Auto-extract everything you can (table names, column types, SQL lineage, dbt descriptions, BI usage stats) before asking humans to write descriptions. Human-written metadata should be the last 20%, not the first 80%. Reverse this and the project will collapse under its own weight.

Myth vs Reality

Myth

We need a data catalog before we do anything else with our data

Reality

If you have under 50 production tables and a 5-person data team that knows them all, a catalog is overhead. Catalogs become essential around 100+ tables and 20+ data consumers — the tipping point where tribal knowledge breaks down. Premature catalog adoption is a top-3 cause of failed metadata initiatives.

Myth

A wiki (Confluence, Notion) is a cheap data catalog

Reality

Wikis are static. Schemas change weekly. Within 90 days, every Confluence data dictionary is wrong about 30%+ of its content, and analysts learn to distrust it. A real catalog auto-syncs metadata from the warehouse on every change. The cost difference between a wiki and a catalog is real, but so is the trust difference — and trust is what makes the catalog get used.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

A 600-person company bought an enterprise data catalog 12 months ago. They spent $180K on the license and 6 person-months filling in metadata. Today, weekly active users of the catalog is ~15 (mostly the data team itself). Analysts still ask in Slack 'which table has revenue?' What is the most likely root cause?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Catalog Adoption (% of data consumers active weekly)

Mid-to-large enterprises (Atlan, Alation, OpenMetadata customer benchmarks)

Best-in-class (workflow-integrated)

60-80%+

Good

30-60%

Average

15-30%

Shelfware

<15%

Source: https://atlan.com/active-metadata/

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🅰️

Atlan

2020-present

success

Atlan built the active metadata category by treating the catalog as a workflow tool — deep integrations with Slack, dbt, Snowflake, BigQuery, Looker, Tableau, and Mode. Customers like Postman, Plaid, and Hubspot publish that catalog-driven self-serve rates 4-7x within 6 months once Slack and BI integrations are live. The decisive product insight: analysts don't change tools to find data; they search where they already work. Atlan's growth (Series C, $750M+ valuation) is essentially proof that workflow integration is the catalog category's only durable differentiator.

Customer Self-Serve Lift

4-7x in 6 months

Primary Differentiator

Slack + dbt + BI workflow integration

Funding

Series C, $750M+ valuation

Notable Customers

Postman, Plaid, Hubspot, Autodesk

The catalog category is being won by workflow integration, not metadata depth. Buy the catalog that integrates with the tools your analysts already live in.

Source ↗
🟦

Microsoft Purview

2021-present

success

Microsoft Purview (formerly Azure Purview) became the de facto enterprise catalog for Azure-heavy organizations by being bundled with E5 licenses and integrating natively with Synapse, Fabric, and Power BI. Adoption has been strong in regulated industries (financial services, healthcare, public sector) where Microsoft is already the strategic vendor. The trade-off is breadth (full Microsoft stack coverage) vs depth (less polished UI than dedicated competitors like Atlan). For organizations standardized on Microsoft, the bundle economics are nearly impossible for a standalone catalog to beat.

Native Integrations

Synapse, Fabric, Power BI, Purview governance

Bundling

Included in E5 / paid Azure tiers

Sweet Spot

Regulated, Microsoft-standardized enterprises

Key Trade-off

Breadth vs UX polish

Catalog selection often comes down to ecosystem alignment more than feature parity. The 'best' catalog is usually the one that comes with the warehouse and BI stack you've already standardized on.

Source ↗
🛍️

Hypothetical: Mid-Market Retailer

2022-2023

failure

A 600-person retailer bought Collibra for $250K/year and assigned a 2-person metadata team to populate it. After 14 months: 8,000 tables cataloged, weekly active users averaged 11 (mostly the metadata team itself). No Slack integration, no BI sidebar, no dbt linkage — analysts continued asking in Slack. The CFO de-funded renewal. Total spent: $290K including services, with no measurable change in analyst productivity. The lesson the team wrote up internally: 'we bought a catalog, not a workflow.'

Annual License + Services

$290K

Tables Cataloged

8,000

Weekly Active Users

~11

Workflow Integration

None

Coverage is the vanity metric of data catalogs. Adoption is the real one. A 200-table catalog used by 80% of analysts beats an 8,000-table catalog used by 5%.

Decision scenario

Choosing the Catalog That Sticks

You're VP of Data at a 1,200-person SaaS company. Snowflake-based stack, dbt for transformations, Looker for BI, ~250 analysts and engineers consuming data. The CFO has approved $250K/year for a catalog. You're choosing between Atlan ($220K with deep Slack + dbt + Looker integration), Microsoft Purview ($60K incremental on existing E5, but weaker non-Microsoft integrations), and OpenMetadata (open source, $0 license but ~$200K of integration engineering).

Data Consumers

~250

Production Tables

~1,800

Approved Budget

$250K/year

Stack

Snowflake + dbt + Looker + Slack

Current State

Tribal knowledge + Confluence

01

Decision 1

All three are technically capable. The decision is really about adoption mechanics. You have one shot — a failed catalog rollout typically poisons the well for 2-3 years before the org will fund another attempt.

OpenMetadata. Free license, full control, build the integrations in-house with engineering time.Reveal
By month 9, OpenMetadata is deployed with auto-crawl across Snowflake and dbt. Slack bot is half-built but the engineer who started it left. Looker integration is 'on the roadmap'. Coverage is great; adoption is ~12% weekly. CFO asks about the $200K of engineering time; you can't show ROI. Renewal of the engineering investment is denied. The deployment limps along as a metadata reference for the data team only. Open source was technically free and operationally expensive.
Weekly Active Users: 12% of consumersTotal Cost (incl. eng): $200K+ with no ROI
Microsoft Purview. Cheap incremental cost, leverages existing E5 license.Reveal
Purview deploys cleanly with Azure-native services but the integrations with Snowflake, dbt, and Looker are clunky — Microsoft prioritizes Fabric and Power BI. Slack integration doesn't exist out of the box. By month 6, the catalog has crawled metadata but adoption mirrors Confluence (~15% weekly). The cost was low so there's no political fallout, but you've also gotten near-zero value. The catalog becomes a checkbox for compliance auditors, nothing more.
Weekly Active Users: ~15% of consumersOutcome: Compliance checkbox, not productivity tool
Atlan. Higher license cost but deep native integrations with Snowflake, dbt, Looker, and Slack — the exact stack your analysts use.Reveal
Months 1-3: certify top 100 tables with domain leads (painful but bounded). Month 4: launch Slack bot, dbt deep-link integration, Looker sidebar. Month 6: Slack bot answers ~60 dataset questions per week, replacing what used to be senior engineer time. Weekly active users hit ~65% of analysts by month 9. Time-to-first-trusted-table for new hires drops from 3 weeks to 4 days. CFO sees clear ROI: ~$1.8M in productivity recovery on a $220K investment. Budget approved for expansion to data quality and contracts.
Weekly Active Users: ~65% of consumersNew-Hire Ramp: 3 weeks → 4 daysROI: ~8x in year 1

Related concepts

Keep connecting.

The concepts that orbit this one — each one sharpens the others.

Beyond the concept

Turn Data Catalog into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h · No retainer required

Turn Data Catalog into a live operating decision.

Use Data Catalog as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.