AB Testing Platform
An AB Testing Platform is the technical and statistical infrastructure that lets product, growth, and marketing teams ship controlled experiments โ randomly assigning users to variants, measuring outcomes, and deciding whether a change shipped a winning variant or not. The defining components: (1) randomization service (assigns users deterministically), (2) feature flag delivery (toggles variants in client and server code), (3) event ingestion + experiment computation (measures the outcome metrics), (4) statistical engine (frequentist or Bayesian inference, sequential tests, CUPED variance reduction), (5) experimentation portal (UX for designing, launching, monitoring, deciding). The dominant commercial platforms โ Optimizely, AB Tasty, Statsig, GrowthBook, Eppo, LaunchDarkly+Experiments โ differ in their split between feature flagging and statistical sophistication. Big tech (Google, Microsoft, Meta, Netflix, Booking) built their own; mid-market and growth-stage companies overwhelmingly buy.
The Trap
The trap is buying an experimentation platform expecting it to fix a culture that doesn't actually want to learn from experiments. Most companies launch the platform, run 12 experiments in year 1, declare 8 of them 'winners' through eyeball analysis (ignoring the platform's statistics), and quietly stop using it. The honest precondition for an experimentation platform is a culture that accepts losing experiments and ships only the actual winners โ which is a higher bar than most leadership claims to meet. KnowMBA POV: experimentation velocity > experimentation rigor for early-stage products. A startup running 50 quick-and-dirty experiments per quarter learns more than a startup running 5 statistically pristine experiments. Once product-market-fit hardens and growth marginal gains shrink (typically post-Series C), velocity yields to rigor and the full statistical platform earns its keep. Buying a $250K/year platform pre-PMF is a status purchase.
What to Do
Pick the right platform by stage. (1) Pre-PMF startup (<$5M ARR, <50 experiments/year): use a free or low-cost tool โ GrowthBook (open source), Statsig (free tier), or Posthog Experiments. Optimize for velocity and ease of launching. (2) Growth-stage SaaS ($5M-$100M ARR, 50-500 experiments/year): graduate to Statsig, Eppo, or Optimizely. Invest in CUPED variance reduction and shared metrics definitions. (3) Hyperscale ($100M+ ARR, 500+ experiments/year): consider building in-house (or augmenting Statsig/Eppo with custom analysis) to support sequential tests, switchback experiments, and metric trees. Sequence rollout: shared metric definitions FIRST (so every experiment uses canonical 'activation' and 'retention' definitions), platform SECOND, training and review process THIRD. Skipping shared metrics is the dominant failure mode โ every experiment ships a different definition of success and the platform becomes a dashboard cemetery.
Formula
In Practice
Optimizely (founded 2010, post-IPO 2021) built the modern category of commercial experimentation platforms. AB Tasty competes in mid-market. Statsig (founded 2021 by ex-Facebook engineers, $100M+ ARR by 2024) and Eppo (founded 2020) are the high-velocity modern entrants emphasizing warehouse-native architecture and CUPED variance reduction. GrowthBook (open source) targets cost-conscious teams. Microsoft's Experimentation Platform team has published extensively about sequential testing and metric trees from running tens of thousands of experiments per year on Bing and Microsoft 365. Booking.com famously runs 1,000+ experiments concurrently and has published case studies on shipping wrong winners due to peeking, multiple comparisons, and poor metric definition. The shared lesson across platforms: the platform is necessary but the experimentation culture and shared metrics are 80% of the outcome.
Pro Tips
- 01
CUPED (Controlled Pre-Experiment Data) variance reduction can reduce required sample sizes by 30-50% โ meaning faster experiments and the ability to detect smaller lifts. Modern platforms (Statsig, Eppo, Optimizely) implement it; check that yours does before paying premium.
- 02
Sequential testing (mSPRT, group sequential) lets you peek at results without inflating false positive rates. Without it, peeking destroys experiment validity โ and humans WILL peek. Pick a platform that supports sequential tests if you have non-statistical end users.
- 03
Shared metric definitions are the single most-undervalued investment. Without them, every experiment ships a different version of 'activation' and analysis becomes incomparable across the org. Build canonical metric definitions in the platform (or in dbt feeding the platform) FIRST. Then add experiments.
Myth vs Reality
Myth
โMore experiments always lead to better productsโ
Reality
Volume alone doesn't deliver outcomes. A team running 200 experiments per quarter with poor metric definitions, peeking violations, and confirmation bias ships more wrong winners than right ones. Microsoft's published research suggests roughly two-thirds of well-designed experiments at hyperscale companies fail to produce the predicted lift โ meaning shipping based on intuition (without experimentation) is wrong about two-thirds of the time. Both volume AND rigor matter; either alone underperforms.
Myth
โAB testing platforms are commodities โ pick the cheapestโ
Reality
Platforms differ materially in CUPED implementation, sequential test support, warehouse-native architecture, metric layer integration, and statistical sophistication. The wrong choice at hyperscale costs millions in shipping wrong winners. The wrong choice at startup scale costs nothing because you're not running enough experiments to need rigor. Match platform sophistication to experimentation velocity, not to brand recognition.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your $20M ARR Series B SaaS company runs ~80 experiments per year. You're choosing between a free tool (GrowthBook), a mid-tier platform (Statsig at ~$60K/year), and an enterprise platform (Optimizely at ~$200K/year). What's the right call?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Experimentation Platform Tier by Volume
Experimentation platform tier sweet spots by experiment volume<50 exp/year (Free Tools)
GrowthBook, Posthog, Statsig free
50-500 exp/year (Mid-Tier)
Statsig, Eppo, AB Tasty $50K-$200K
500-5,000 exp/year (Enterprise)
Optimizely, Statsig Enterprise $200K-$1M+
5,000+ exp/year (Build In-House)
Booking, Microsoft, Meta, Netflix custom
Source: https://exp-platform.com/Documents/2017-08KDD-CUPED.pdf
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Statsig
2021-present
Statsig was founded in 2021 by ex-Facebook engineers who built Facebook's internal experimentation platform. The product combines feature flagging, experimentation, and product analytics in a warehouse-native architecture with CUPED, sequential testing, and integrated metric definitions. By 2024, Statsig reportedly crossed $100M ARR with customers including OpenAI, Notion, Atlassian, and Brex. The growth pattern reflects market demand for modern experimentation infrastructure: rigor of big-tech platforms with the UX of a startup tool.
Founded
2021
Reported ARR (2024)
$100M+
Notable Customers
OpenAI, Notion, Atlassian, Brex
Differentiator
Warehouse-native + CUPED + sequential
Modern experimentation platforms have raised the floor on statistical sophistication. Mid-market companies can now access big-tech-quality experimentation infrastructure for $60-200K/year.
Booking.com
2010-present
Booking.com runs one of the largest experimentation programs in the world โ 1,000+ concurrent experiments at peak, with a custom-built platform supporting their entire product surface. Booking has published extensively about both wins (>$1B in cumulative annual revenue attributed to experimentation lifts over the years) and failure modes (multiple comparisons, peeking, novelty effects, weekly seasonality). Their public talks emphasize that the platform is necessary but the experimentation CULTURE is what drives outcomes.
Concurrent Experiments (peak)
1,000+
Cumulative Revenue Impact
>$1B over years
Platform
Custom-built in-house
Published Failure Modes
Peeking, multiple comparisons, novelty
At hyperscale, experimentation platform investment pays for itself many times over โ but only when paired with the cultural discipline of accepting losing experiments and refusing to ship them.
Optimizely
2010-present
Optimizely defined the modern category of commercial AB testing platforms, going public in 2021 (later taken private). The platform powers experimentation for tens of thousands of customers across e-commerce, SaaS, and media. Optimizely's published case studies span from early conversion rate optimization (CRO) wins (10-30% lifts on landing pages) to mature programs running hundreds of experiments per quarter. The platform's evolution reflects market maturation: early growth came from CRO simplicity; current growth comes from full feature management + experimentation integration.
Founded
2010
Customer Base
Tens of thousands across industries
Public Listing Era
2021 (later private)
Use Cases
CRO, product experimentation, feature management
The commercial experimentation market is mature, with clear vendor tiers. Match the vendor to your stage and volume; don't pay enterprise prices for mid-market needs.
Decision scenario
The Experimentation Platform Purchase Decision
You're VP Growth at a Series B SaaS company at $15M ARR. Your team currently runs ~30 experiments per year using a basic feature flag tool with manual analysis in SQL. The CMO wants to buy Optimizely ($180K/year). The CEO is skeptical. The data team is overloaded.
Current Experiments per Year
~30
Current Tooling Cost
$0 (manual)
Optimizely Quote
$180K/year
Data Team Capacity
Already overloaded
Win Rate (estimated)
~10% (low confidence)
Decision 1
The CMO wants to sign the Optimizely contract this quarter to 'modernize the experimentation function'. The data team can't credibly support a 6x increase in experiments with current capacity. Win rate is low because metric definitions are inconsistent across experiments.
Sign the Optimizely contract โ the platform will force discipline and the CMO is right that experimentation should be a strategic capabilityReveal
Reject the enterprise platform. Spend Q1 fixing shared metric definitions in dbt, then Q2 deploying Statsig free tier or GrowthBook ($0-$30K/year). Re-evaluate need for an enterprise platform when experiment volume reaches 100+/year and the metric foundation is solid.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn AB Testing Platform into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn AB Testing Platform into a live operating decision.
Use AB Testing Platform as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.