RAG Architecture Design
RAG (Retrieval-Augmented Generation) is the architecture that grounds an LLM in your private documents by retrieving relevant chunks at query time and injecting them into the prompt. The pipeline has five components: ingestion (parsing + chunking), embedding (turning chunks into vectors), storage (a vector DB), retrieval (similarity search + reranking), and generation (the LLM call with retrieved context). RAG is how you get an LLM to answer 'What's our refund policy?' from your own help center without retraining the model. It is the single highest-ROI AI architecture pattern in enterprise โ and the one most consistently botched.
The Trap
The trap is treating RAG as 'embed your docs and ship.' The naive pipeline retrieves the wrong chunks 30-50% of the time on real enterprise data because: (1) your documents are messy PDFs and Confluence exports, not clean Markdown, (2) chunking by 512 tokens cuts policies in half, (3) embeddings retrieve based on semantic similarity, not relevance โ 'What's our refund policy?' often pulls the marketing page about refunds, not the actual policy, (4) one shot retrieval misses multi-hop questions. The fix isn't a better embedding model; it's better chunking, hybrid search (keyword + semantic), and reranking.
What to Do
Build RAG in five layers and measure each one independently. (1) Chunking: chunk by semantic boundaries (sections, not token counts) and store metadata (doc title, section, date). (2) Hybrid retrieval: combine BM25 keyword search with vector search; either alone misses ~30%. (3) Reranking: use a cross-encoder to re-score the top 50 candidates down to the top 5. (4) Citation: force the LLM to cite the chunk ID it used; reject answers without citations. (5) Eval set: 100+ real questions with hand-labeled correct chunks. Measure retrieval recall@5 separately from answer quality.
Formula
In Practice
Notion AI's 'Q&A' feature uses RAG over your workspace. Anthropic's documentation cites multiple production deployments where customers tuned chunking and reranking to lift answer accuracy from ~60% to >90% on internal-knowledge tasks. The pattern is consistent: the lift came from retrieval-layer fixes (better chunking, hybrid search, reranking), not from upgrading the LLM.
Pro Tips
- 01
Always log the retrieved chunks alongside the answer. When users complain about a wrong answer, 80% of the time the LLM was right given what was retrieved โ the retrieval was wrong. You can't debug what you can't see.
- 02
Reranking is the cheapest lift in RAG. A small cross-encoder reranker on top 50 โ top 5 typically adds 10-25 points to recall@5 for a few extra cents per query. Skip it and you're leaving accuracy on the table.
- 03
If your documents change frequently, build incremental re-embedding into the pipeline from day one. Backfilling 6 months of stale embeddings is the most expensive technical debt in RAG systems.
Myth vs Reality
Myth
โBigger context windows kill RAGโ
Reality
False. Long-context models complement RAG; they don't replace it. Even with 1M-token windows, you still need to retrieve relevant docs (you have 10M+ tokens of corporate content), and stuffing everything wastes money and degrades attention quality. The best architectures use RAG to pre-filter, then leverage long context for nuanced reasoning across 10-20 retrieved docs.
Myth
โBetter embedding models solve retrieval problemsโ
Reality
Embedding upgrades typically add 2-5 points of recall. Hybrid search adds 10-15. Reranking adds 10-25. Better chunking adds 10-30. The embedding model is rarely the bottleneck once you're using a competent one (e.g., text-embedding-3 or Voyage).
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your RAG system has 65% answer accuracy on a 100-question eval set. The LLM almost always gives a correct answer when the right chunk is in the context window. What's the highest-leverage fix?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
RAG Retrieval Recall@5
Enterprise document Q&A, after hybrid retrieval + rerankingExcellent
> 90%
Good
80-90%
Average
65-80%
Poor
< 65%
Source: Anthropic & vector DB vendor (Pinecone, Weaviate) public benchmarks
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Notion AI Q&A
2024
Notion shipped a workspace-scoped Q&A feature powered by RAG over user documents. Public engineering posts discuss the iterative path from naive embeddings to production quality: better chunking respecting page hierarchy, hybrid retrieval, and per-workspace permission filters. The result: a feature users actually trust rather than a demo that hallucinates.
Architecture
RAG with permission filters
Key Lift
Hierarchy-aware chunking + hybrid search
RAG quality in enterprise comes from respecting the structure of source documents and combining multiple retrieval signals โ not from picking the trendiest LLM.
Hypothetical: Internal Knowledge Bot at a Large Bank
Composite scenario
A retail bank built an internal RAG bot for branch staff to query policy documents. v1 used naive chunking and dense embeddings only โ recall@5 was 54%, branch staff abandoned it. A 6-week rebuild added: (a) PDF parsing that preserved tables, (b) section-aware chunking, (c) BM25 + vector hybrid retrieval, (d) cross-encoder reranking, (e) forced citations. Recall@5 jumped to 87%, daily active users went from 40 to 1,800.
v1 Recall@5
54%
v2 Recall@5
87%
DAU (v1)
40
DAU (v2)
1,800
RAG is a pipeline, not a model. The 33-point recall jump came entirely from non-LLM components: parsing, chunking, retrieval fusion, and reranking.
Decision scenario
The RAG Bake-Off
Your team has a working RAG MVP at 65% recall@5 on your eval set. The CEO wants to ship in 4 weeks. You have $40K of cloud + API budget for the quarter. The product team wants to add features; the platform team wants to fix retrieval.
Current Recall@5
65%
Eval Set Size
120 questions
Time to Ship
4 weeks
Budget Remaining
$40,000
Decision 1
You can either ship at 65% accuracy with prominent 'AI may be wrong' disclaimers, OR delay 2 weeks to fix the retrieval layer first.
Ship at 65% with disclaimers โ users will tell us what's broken in production, and we'll learn faster from real trafficReveal
Delay 2 weeks. Spend $5K on a reranker, $3K on better chunking infrastructure, and rebuild the eval set to 250 questions. Then ship.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn RAG Architecture Design into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn RAG Architecture Design into a live operating decision.
Use RAG Architecture Design as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.