AI StrategyIntermediate7 min read

Fine-Tuning vs RAG

Fine-tuning and RAG solve different problems. Fine-tuning teaches the model new STYLE, FORMAT, or specialized BEHAVIOR — how to respond. RAG provides new KNOWLEDGE — what facts to use. The decision rule that survives contact with reality: 'Knowledge → RAG. Behavior → Fine-tune. Both → both.' Fine-tuning is expensive (data labeling + training cost + ops cost of a custom model + obsolescence when the base model upgrades). RAG is cheap to start, scales linearly with corpus size, and updates instantly when documents change. 90% of enterprise AI use cases need RAG, not fine-tuning.

Also known asCustom Model vs RetrievalModel Adaptation StrategyWhen to Fine-TuneRAG or Fine-Tune

Challenge a friend Browse library

The Trap

The trap is fine-tuning to fix a hallucination problem. Teams see the model getting facts wrong, assume 'we need to teach it our data,' fine-tune on internal docs, and end up with a model that hallucinates the same facts more confidently. Fine-tuning encodes patterns, not retrieval. The model still doesn't 'know' your facts; it just learned that your docs LOOK like the right answer style. Worse, when you update a policy, the fine-tuned model still gives the old answer until you retrain. RAG fixes this in real time.

What to Do

Use a 3-question decision framework: (1) Does the answer require facts that change over time or differ per customer? → RAG. (2) Does the model need a consistent voice, format, or specialized output structure (legal contracts, code style, JSON schema adherence)? → Fine-tune. (3) Does the model need to follow a complex multi-step procedure unique to your business? → Try few-shot prompting first; fine-tune only if you have >5,000 high-quality examples and the prompt is hitting context limits. Always exhaust prompting + RAG before fine-tuning.

Formula

Total Cost of Customization = (RAG Cost per Query × Queries) + (Fine-Tune Training Cost / Lifetime Queries) + (Maintenance Cost × Months)

In Practice

OpenAI's customer case studies show GitHub Copilot using a combination of fine-tuning (for code-completion behavior) and retrieval (for repo-specific context). Anthropic and OpenAI both publish guidance that recommends prompting → RAG → fine-tuning in that order. Klarna's AI assistant publicly described relying primarily on RAG for product/policy knowledge with prompting strategies, not bespoke fine-tuned models per task — the operational cost of maintaining many fine-tunes was prohibitive.

Pro Tips

01
Fine-tuning for 'voice' is one of the few clear wins: tone of voice, formatting conventions, JSON schema adherence. These are stable behaviors that reward training. Avoid fine-tuning for 'knowledge of our docs' — RAG does this better and updates instantly.
02
Frontier model upgrades happen every 6-12 months. A fine-tuned model on yesterday's base model is depreciating tech debt the day it ships. Plan for re-fine-tuning costs, OR keep your customization in prompts/RAG so model upgrades are trivial.
03
If you genuinely need fine-tuning, parameter-efficient methods (LoRA, QLoRA) get you 90% of the benefit at 10% of the cost. Avoid full fine-tunes unless you have a research team and a clear ROI model.

Myth vs Reality

Myth

“Fine-tuning is the 'real' AI work; RAG is a hack”

Reality

Backwards. RAG is the production-default; fine-tuning is the specialized intervention. The most sophisticated AI products in market (Notion AI, Intercom Fin, Klarna's assistant, GitHub Copilot Chat) lean heavily on RAG and prompt engineering with selective fine-tuning. The 'just fine-tune it' instinct is usually engineering theater that ships nothing.

Myth

“Fine-tuning makes models smarter on your domain”

Reality

Fine-tuning makes models BETTER AT IMITATING your training data's patterns. It can degrade general reasoning if done poorly. A fine-tuned model that's narrow and inflexible is a common, expensive failure mode — especially with small training sets.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

A pharmaceutical company wants their AI assistant to answer doctors' questions about drug interactions using their proprietary clinical database. The database updates monthly. Which approach should they pick?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

When Fine-Tuning Pays Off

Decision heuristic for enterprise AI teams

Strong Fit

Stable narrow behavior + > 10K labeled examples + > 1M queries/year

Possible Fit

Style/format consistency + > 5K examples

Probably Don't Fine-Tune

Knowledge problem OR < 1K examples OR rapidly changing data

Avoid

Trying to teach 'facts' or 'our docs' via fine-tuning

Source: Anthropic & OpenAI deployment guidance + practitioner consensus

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

💳

Klarna AI Assistant

2024

success

Klarna publicly disclosed that their AI assistant handles work equivalent to ~700 full-time agents. The architecture leans heavily on prompting + retrieval over policy/product knowledge with frontier models, rather than maintaining a portfolio of fine-tuned models per task. The operational simplicity of upgrading the underlying model when frontier improvements ship was cited as a key advantage.

Equivalent FTE Workload Handled

~700 agents

Customer Resolution Time

Cut significantly

Architecture Default

Prompting + RAG, selective fine-tuning

At scale, the operational cost of maintaining many fine-tuned models can outweigh marginal accuracy gains versus prompting + RAG with frontier models.

Source ↗

📄

Hypothetical: Insurance Claims AI Refactor

Composite scenario

pivot

A mid-sized insurer fine-tuned a model on 12,000 historical claims to classify claim type and recommend next actions. After 6 months: maintenance cost ballooned, the underlying base model was deprecated, and accuracy degraded as new claim types emerged. They migrated to RAG over a structured claims-rule database with few-shot prompting. Accuracy improved 8 points and maintenance dropped to near-zero.

Pre-Migration Accuracy

84%

Post-Migration Accuracy

92%

Annual Maintenance

$220K → $35K

Time to Add New Claim Type

6 weeks → 1 day

The fine-tuned model was technical debt with a model-version expiration date. RAG decoupled customization from the base model lifecycle.

Related concepts