K
KnowMBAAdvisory
AI StrategyIntermediate7 min read

Foundation Model Selection

Foundation model selection is the disciplined process of choosing which base LLM (or LLMs) to power your AI features given task requirements, latency targets, cost ceiling, deployment constraints (cloud / on-prem / regional), data sensitivity, and vendor risk tolerance. The right answer is rarely 'the best model' โ€” it's a portfolio: a frontier model for hard or open-ended tasks, a mid-tier model for the majority of throughput, and a small/cheap model (or open-weight model) for high-volume bounded tasks. Single-model strategies are brittle to vendor pricing changes, capability shifts, and outages.

Also known asLLM SelectionBase Model SelectionModel Vendor SelectionFrontier Model SelectionModel Picker

The Trap

Two common traps. First: defaulting to whichever vendor your CTO has a relationship with, regardless of task fit. The result is using a $15/M-input frontier model for tasks a $0.50/M-input model could handle, and burning $400K/year unnecessarily. Second: chasing every benchmark leaderboard. The model that wins on MMLU may underperform on your specific use case (long-context summarization, structured extraction, function calling). Benchmarks are a starting point; your evaluation set is the ground truth.

What to Do

Build a 50-200 example evaluation set covering your real use cases. Run candidate models against it, scoring on: accuracy, latency (p50/p95), cost per task, and qualitative fit (tone, structure, refusal behavior). Then add non-functional criteria: data residency, on-prem/private deployment options, content filter behavior, vendor financial stability, and contract terms (data usage, indemnification, deprecation policy). Make a ranked recommendation per use case tier. Re-evaluate every 6 months โ€” the model landscape moves fast and yesterday's best can become uneconomic vs. a newer entrant.

Formula

Right Model per Task = ARGMIN(Cost) WHERE Quality โ‰ฅ Threshold AND Latency โ‰ค Budget AND Deployment Constraints Met

In Practice

Frontier model providers โ€” OpenAI (GPT-5.x family), Anthropic (Claude family), Google (Gemini family) โ€” compete on capability, cost, and latency, with rankings shuffling per release. Open-weight models (Meta Llama, Mistral, Alibaba Qwen, DeepSeek) provide lower cost and on-prem options for many tasks. Hugging Face's Open LLM Leaderboard, LMSys Chatbot Arena, and HELM are commonly cited evaluation references. Most production-mature enterprises run a mix โ€” Anthropic for high-stakes drafting, an open-weight model for high-volume classification, and a frontier model for agentic workflows.

Pro Tips

  • 01

    The cheapest model that hits your quality threshold is the right model โ€” not the most capable one. Most teams over-spec by 5-10x because no one runs the cost/quality tradeoff explicitly. Run it.

  • 02

    Always have a 'fallback' model in your gateway โ€” when the primary fails or rate-limits, traffic seamlessly routes to a secondary. This single architectural decision converts vendor outages from incidents to non-events. Most production AI gateways now ship with this built-in.

  • 03

    Negotiate volume contracts for your top one or two models, but keep at least one alternative integrated even if you don't route traffic to it. Vendor-lock-in risk in AI is high โ€” models deprecate, prices change, capabilities shift. Optionality is cheap to maintain and expensive to retrofit under pressure.

Myth vs Reality

Myth

โ€œThe model with the highest benchmark score is the best choiceโ€

Reality

Benchmarks measure general capability. Your use case is specific. A model that scores 92% on MMLU may underperform a 87%-MMLU model on your customer service classification task. Always evaluate on your data โ€” benchmarks are a screen, not a verdict.

Myth

โ€œWe need to fine-tune a base model to get good resultsโ€

Reality

For 80% of use cases, a strong base model with good prompting and retrieval beats a fine-tuned smaller model. Fine-tuning is justified when you've exhausted prompting/RAG and the cost/latency of a smaller fine-tuned model is materially better than a prompted base model. Start with prompting, escalate as needed.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

A company processes 50M customer support classifications per month. Each classification requires the model to assign one of 12 categories. Which foundation model strategy is most appropriate?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets โ€” not absolutes.

Foundation Model Strategy Maturity

Enterprises with AI features in production

Mature

Multi-model portfolio, eval-driven, gateway with fallback, 6-month review cadence

Functional

2 models in production, periodic re-eval

Single-Vendor

One model for all use cases

Default-Choice

Whatever the engineer picked first, no formal evaluation

Source: Vendor evaluation patterns from Andreessen Horowitz Enterprise AI surveys + practitioner reports

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐Ÿค—

Hugging Face Open LLM Leaderboard

2023-present

success

Hugging Face hosts an evolving leaderboard of open-weight LLMs evaluated on standardized benchmarks (MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, Winogrande). The leaderboard demonstrates how rapidly the open-weight model landscape evolves โ€” top models change every few months, and capability gaps with closed frontier models have narrowed significantly. Enterprises that built selection processes around the leaderboard plus their own evaluation sets have been able to switch models as capability and cost shift, capturing material savings.

Models Evaluated

Thousands

Eval Suite

MMLU, ARC, HellaSwag, TruthfulQA, GSM8K, Winogrande

Update Frequency

Continuous

Public leaderboards plus your own evaluation set is the right combination. Public leaderboards screen for general capability; your eval set verifies fit. Either alone is insufficient.

Source โ†—
๐Ÿ’ธ

Hypothetical: Single-Vendor Lock-In Bill Shock

Composite scenario

pivot

A B2B SaaS standardized on a single frontier model in 2023 for all AI features. By Q4 2024, monthly AI costs had grown to $480K. A late-2024 evaluation found that 70% of their workload (categorization, summarization of short content, structured extraction) ran equally well on a model 20x cheaper. Migration to a multi-model gateway took 14 weeks of engineering. New monthly AI cost: $145K, a 70% reduction. The lesson: single-vendor commitments at the start of an AI program become structural cost problems within 18 months as the model landscape evolves.

Pre-Migration Monthly Cost

$480K

Post-Migration Monthly Cost

$145K

Annual Savings

~$4M

Migration Investment

~14 weeks engineering

Multi-model architecture is cheap insurance and strategic optionality. Single-vendor strategies that made sense in early 2023 are usually wrong by late 2024.

Decision scenario

Picking Models for a New AI Product

You are CTO of a B2B legal-tech startup launching an AI-powered contract review feature. Three model families are on the table: a frontier closed model, a mid-tier closed model, and an open-weight model you'd self-host. Workload projection: 2M document analyses per month, average 5K input tokens and 1K output tokens. The product targets mid-market law firms; data sensitivity is high (privileged client material).

Projected Monthly Volume

2M analyses

Avg Tokens per Analysis

5K in / 1K out

Customer Type

Mid-market law firms

Data Sensitivity

High (privileged)

Time to Launch

12 weeks

01

Decision 1

First decision: lead model choice. The frontier model has the best quality on a 100-document evaluation (94% accuracy on key extraction). The mid-tier model scores 88%. The open-weight self-hosted model scores 82%. The frontier model would cost ~$160K/month at projected volume; mid-tier ~$24K/month; self-hosted ~$8K/month plus infrastructure.

Lead with the frontier model โ€” quality is paramount in legal-tech and the cost is justifiedReveal
Quality is excellent at launch but margins are catastrophic. Gross margin on the AI feature is -15% at the price point that wins customers. By month 8 the company is burning through cash 3x faster than plan. A late re-architecture is forced under pressure with degraded customer experience as you migrate live.
Monthly AI Cost at Scale: $160KGross Margin: -15%Runway Impact: Shortened by 18 months
Lead with the mid-tier model for default workload, route to frontier only for low-confidence cases (~5% of volume), and keep self-hosted as a future option for volume-dominant tasksReveal
Blended monthly cost is approximately $33K (mid-tier $24K + frontier on 5% โ‰ˆ $9K). Quality is 92% (mid-tier baseline plus frontier escalation on hard cases). Margin is healthy at 62%. The architecture supports adding self-hosted as a third tier in 6 months when volume justifies it. The customer experience is excellent and economics are sustainable.
Monthly AI Cost at Scale: ~$33KQuality: 92%Gross Margin: 62%
02

Decision 2

Second decision: data sensitivity. The frontier vendor offers a no-train-on-data zero-retention enterprise tier with BAA-equivalent privacy controls; the mid-tier vendor's standard enterprise contract excludes training but does not guarantee zero retention; the open-weight self-hosted option keeps everything in your VPC.

Use the standard mid-tier contract since training is excluded โ€” retention is unlikely to be a real issueReveal
Six months in, an enterprise prospect demands evidence of zero retention as a procurement requirement. You can't provide it. Deal stalls. Two more enterprise prospects raise the same concern. Engineering rushes a contract amendment with the vendor (possible but takes 8 weeks of legal back-and-forth). Three deals slip a quarter. CFO is unhappy.
Enterprise Deals Slipped: 3Contract Amendment Time: 8 weeks
Negotiate zero-retention terms with both frontier and mid-tier vendors before launch, and build the architecture so workloads can be routed to self-hosted later if a customer requires itReveal
Procurement at every enterprise customer accepts the contract terms. Three customers later request VPC deployment for compliance reasons; the architecture supports it within 6 weeks per customer. Sales velocity is uninterrupted. The early investment in contract terms and architectural optionality saves an order of magnitude more than it costs.
Procurement Friction: MinimalVPC Deployment Time per Customer: 6 weeks

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn Foundation Model Selection into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn Foundation Model Selection into a live operating decision.

Use Foundation Model Selection as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.