AI Model Versioning
AI model versioning is the discipline of explicitly identifying, pinning, and tracking which exact model version is serving each production request. Three things must be versioned: (1) the underlying model โ for vendor APIs this means the dated snapshot (e.g., gpt-4o-2024-11-20, claude-3-5-sonnet-20241022); for self-hosted models it means a content hash of the weights. (2) the prompt โ versioned in git with a hash. (3) the surrounding orchestration code โ the agent loop, tool definitions, and post-processing. The combination of all three uniquely identifies a 'serving version.' Without versioning, every production prediction is unreproducible, every quality regression is undebuggable, and every vendor model update silently changes your application's behavior.
The Trap
The trap is using model aliases like 'gpt-4o', 'claude-3-5-sonnet-latest', or 'gemini-pro'. These resolve to whatever version the vendor currently considers default โ which changes without warning. Your eval suite that scored 89% last week may score 76% today because the alias now points to a new snapshot. The second trap is versioning only the model and not the prompt. Prompts evolve faster than models; if you can't tell which prompt version produced a given output, you can't debug. The third: deleting old model versions when a new one ships, leaving you no rollback path. Always keep the previous 1-2 versions deployable and warm.
What to Do
Mandate explicit version pinning across the full stack: (1) Pin every vendor model to the dated snapshot in code or config (e.g., 'gpt-4o-2024-11-20', not 'gpt-4o'). (2) Hash and version every prompt in git; tag the running version on every prediction log. (3) Maintain a 'model registry' with every active model, its version, deployment date, eval score at deploy, and rollback target. (4) When a new vendor version ships, run your full eval suite against both versions BEFORE switching. Switch only on green. (5) Keep N-1 deployable for at least 30 days. (6) Stamp every prediction in production logs with: model_version, prompt_version, code_commit. This is your debug receipt.
Formula
In Practice
OpenAI publishes dated snapshot model IDs (e.g., gpt-4o-2024-11-20, gpt-4o-2024-08-06, gpt-4-0613) precisely because version pinning is required for production reproducibility. Their model deprecation policy gives ~6-12 months of notice before a snapshot is retired โ but if you're using the alias 'gpt-4o' instead of a dated version, you get 0 notice for behavior changes. Anthropic uses the same pattern (claude-3-5-sonnet-20241022, claude-3-opus-20240229). MLflow Model Registry, SageMaker Model Registry, and Azure ML Model Registry all enforce versioning for self-hosted models. Hugging Face requires either commit hashes or revision tags. The pattern is universal: every serious AI infrastructure provides explicit versioning because reproducibility requires it.
Pro Tips
- 01
Build a 'model promotion' workflow: any change to the model version pin requires a PR with eval results attached. Reviewers check that the eval delta is acceptable, not just that the code looks right. This catches 90% of would-be regressions before they ship.
- 02
Stamp predictions with the exact serving version in your logs. When a customer reports a bad response, you should be able to query: 'show me the model version, prompt version, and tool config that produced this prediction.' Without that, every quality investigation becomes an archeology project.
- 03
When a vendor announces a deprecation, do NOT wait until the deadline. Eval the new version against your suite within the first 2 weeks of availability. Migrate at YOUR pace with confidence. Teams that scramble at the deprecation deadline ship regressions they would have caught with leisurely eval.
Myth vs Reality
Myth
โVendor models are stable enough that pinning is overkillโ
Reality
Even within a single dated snapshot, behavior can shift via safety updates, infra changes, or rate-limit-driven routing. The dated snapshot is the most stable identifier you have, but pinning isn't paranoia โ it's the minimum bar. Without it, you're trusting the vendor with your application's behavior, which they have not signed up for.
Myth
โIf we always run eval, we don't need version pinningโ
Reality
Eval tells you the score TODAY. Versioning tells you what produced any historical prediction. When a customer escalates a 6-week-old transcript, you need to reproduce it โ which requires the exact model version, prompt, and code. Eval and versioning are complementary, not substitutes.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your team's production code references the model as 'gpt-4o' (no date). On Monday, customer complaints spike about a quality drop. You check git: no code changed in 9 days. What is the MOST likely cause and the right immediate action?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Versioning Discipline (Production AI)
Enterprise AI workloads using vendor APIsElite โ 100% pinned, eval-gated upgrades
100% coverage
Strong โ Most features pinned, ad-hoc eval
70-99%
Average โ Some pinning, no consistent eval
30-70%
Weak โ Aliases everywhere
< 30%
Source: OpenAI / Anthropic versioning best practices
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
OpenAI Model Snapshot Policy
2023-present
OpenAI publishes dated snapshot IDs for every model (gpt-4o-2024-11-20, gpt-4o-2024-08-06, gpt-4-0613, etc.) and provides ~6-12 months of deprecation notice before retiring a snapshot. The policy explicitly warns that the un-dated alias 'gpt-4o' may change behavior without notice. Customers who pin to dated snapshots get reproducibility and time to migrate; customers using aliases get 'free' silent updates that often cause regressions. The architecture choice is deliberate โ OpenAI provides the tools for stability; using them is the customer's responsibility.
Snapshot Notice Period
~6-12 months for deprecation
Alias Notice Period
0 (changes silently)
Affected Customers
Anyone using aliases instead of dated IDs
Pin to dated snapshots in production. Aliases are convenience for prototyping, not infrastructure for production.
Hypothetical: Mid-Market SaaS Migration Scramble
2025
Hypothetical: A mid-market SaaS used 'gpt-4-turbo' (no date pin) across 8 production features. When OpenAI deprecated gpt-4-turbo with 90-day notice, the team had no eval infrastructure and no version pins. They spent 6 weeks of engineering time scrambling to migrate to gpt-4o, discovering 3 quality regressions only after customer complaints. Two regressions required prompt rewrites; one required a different model entirely. Total emergency cost: ~$120K in engineering time plus 4 customer escalations. Post-incident, they instituted hard pinning, per-feature eval suites, and a quarterly 'model freshness review' to evaluate new vendor versions on their own schedule.
Features Affected
8
Emergency Engineering Cost
~$120K
Customer Escalations
4
Time to Recovery
6 weeks
The cost of not versioning is paid in deprecation cycles. Pin first, eval second, migrate on your schedule.
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn AI Model Versioning into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn AI Model Versioning into a live operating decision.
Use AI Model Versioning as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.