🧪
Knowledge Challenge
A friend thinks you can answer this question about AI Experiment Design
You're rolling out a new LLM that costs 3x more but seems to give better answers in offline eval. What's the right way to validate it in production?
A friend thinks you can answer this question about AI Experiment Design
You're rolling out a new LLM that costs 3x more but seems to give better answers in offline eval. What's the right way to validate it in production?