AI Context Window Strategy
Context window strategy is how you decide what goes into the model's input window — and equally important, what does NOT. Modern frontier models offer 200K-1M token windows (Claude, Gemini), but that does not mean you should fill them. Cost scales linearly with input tokens; latency scales with input tokens; and accuracy follows a U-shape — models pay attention best to the start and end of context, dropping recall in the middle ('lost in the middle' effect, Liu et al. 2023). The right strategy is rarely 'stuff everything in.' It's: retrieve the smallest sufficient context, structure it predictably, and use prompt caching to amortize the static portion. A 200K-token prompt that costs you $0.60 per call to ship a 90% answer is worse than a 15K-token RAG prompt that costs $0.05 to ship a 92% answer.
The Trap
The trap is treating long context as a substitute for retrieval. Teams discover Claude or Gemini can hold an entire 500-page contract and start dumping the whole document into every request, paying full input tokens each time and watching latency balloon to 30+ seconds. Worse, accuracy degrades: the model misses key clauses buried in the middle. The opposite trap is over-pruning context to save tokens, where retrieval misses critical information and the model hallucinates because it's missing the answer entirely. Token budgets should be set per request type, not globally.
What to Do
Define a tiered context strategy: (1) Static system context (cached aggressively — see Anthropic prompt caching, OpenAI prompt caching) — instructions, examples, schemas. (2) Retrieved context (sized to top-K relevance, usually 5-15 chunks of 500-1000 tokens each). (3) Conversation history (summarize after N turns). Set a per-request-type token budget and alert when prompts exceed it. Use prompt caching for the 60-80% of context that is identical across calls — Anthropic's prompt caching offers up to 90% savings on cached input tokens. Re-evaluate quarterly: caching support, model recall curves, and pricing all shift.
Formula
In Practice
Anthropic's Claude prompt caching launched with up to 90% cost reduction and ~85% latency reduction on cached input tokens, with cache lifetimes of 5 minutes (with refresh on use). Production teams using Claude with long static system prompts and frequent calls (coding agents, document analysis pipelines, customer support copilots) routinely report 70-85% input-token spend reductions just from caching the static portion of their context. The savings are biggest exactly where prompts are longest — which is also where the cost problem is worst without caching.
Pro Tips
- 01
If you're calling a frontier model with a system prompt longer than 1,000 tokens and you're not using prompt caching, you have homework. Cached input is roughly 10% of standard input price across major providers — a 90% line-item reduction on the static portion.
- 02
The 'lost in the middle' effect is real and persists in 2026 frontier models. Critical instructions should go at the start AND repeated at the end, not buried in the middle.
- 03
Long context is not free even if it 'fits.' A 100K-token prompt at $3/M input tokens costs $0.30 just for input — multiply by request volume and the long-context tax shows up in your invoice fast.
Myth vs Reality
Myth
“Bigger context windows make RAG obsolete”
Reality
Long context complements RAG, doesn't replace it. Stuffing 1M tokens of corpus into every request is 100x more expensive than retrieving 10 relevant chunks. Long context is for the rare case where you genuinely need cross-document reasoning across a large set; RAG handles the 95% case where you only need a few relevant passages.
Myth
“Models attend equally to all parts of the context window”
Reality
Recall studies (needle-in-haystack tests, RULER benchmark) consistently show degraded recall in the middle 50-70% of long contexts, even on frontier models advertising '1M token' windows. Position your most important information first or last, never buried.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.
Knowledge Check
Your customer-support copilot has a 12K-token system prompt (instructions + examples + product knowledge), and handles 50K conversations/month. Each conversation averages 4 LLM turns. You're not using prompt caching. What's the highest-leverage optimization?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets — not absolutes.
Prompt Caching Discount on Cached Input Tokens (frontier models, 2026)
Approximate discounts on the cached portion of input tokens; pricing changes — verify with provider docsAnthropic Claude (cache read)
~90% off standard input
OpenAI (cached input)
~50% off standard input
Google Gemini (context caching)
~75% off standard input + storage fee
Source: Anthropic prompt caching docs, OpenAI prompt caching docs, Google Vertex AI context caching docs
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Anthropic Claude Prompt Caching
2024-2026
Anthropic launched prompt caching offering up to 90% cost reduction and ~85% latency reduction on cached input tokens. Production teams running coding agents, document analysis, and customer support copilots — workloads with long, repeated system prompts — publicly reported 70-85% reductions in their input-token line item. The change required no model swap and no prompt rewrite, only marking the cached portion of the prompt with a cache control breakpoint. The wider lesson: large input prompts went from a structural cost problem to a structural cost advantage.
Cached Token Discount
Up to 90%
Latency Reduction on Cached Reads
~85%
Cache TTL
5 min (refreshes on use)
Typical Realized Savings (long-prompt apps)
70-85% of input spend
When the static portion of your prompt is large (5K+ tokens) and your call volume is high, prompt caching is the single highest-leverage configuration change available. The savings come from architecture, not from worse outputs.
Hypothetical: Internal Coding Assistant
2025
Hypothetical: A 400-engineer org rolled out an internal coding copilot built on a frontier API. The system prompt was 18K tokens (style guide, repo conventions, common patterns, 10 examples). At 80 calls/engineer/day, monthly input spend hit $94K. Engineering enabled prompt caching on the static 18K-token system prompt with a one-day change. Spend on input tokens dropped to ~$15K/month within a week — output costs unchanged.
Engineers
400
Calls/Engineer/Day
80
Static System Prompt
18K tokens
Input Spend (before)
$94K/month
Input Spend (after caching)
~$15K/month
Hypothetical: Internal AI tools with long fixed system prompts have outsized caching upside because the static portion dwarfs the dynamic portion of every call.
Decision scenario
Long Context vs RAG Decision
You're building a 'chat with our 2,000-page product manual' feature. Your engineer proposes loading the entire manual (~600K tokens) into Claude's 1M context window for every query, citing simplicity. The CFO has separately asked for a unit-cost forecast. You're choosing the architecture this week.
Manual Size
~600K tokens
Expected Queries/Day
10,000
Input Token Price
$3/M tokens
Avg Output
500 tokens
Decision 1
If you stuff the full manual into every request, that's 600K input tokens × 10K queries/day. Even with caching, you're paying real money — and accuracy on buried passages may suffer.
Stuff the full 600K-token manual into every request — simplest implementation, no retrieval system to buildReveal
Build RAG: chunk the manual, embed once (batch), retrieve top-10 chunks per query (~8K tokens), use cached system prompt✓ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one — each one sharpens the others.
Beyond the concept
Turn AI Context Window Strategy into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h · No retainer required
Turn AI Context Window Strategy into a live operating decision.
Use AI Context Window Strategy as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.