🧪

Knowledge Challenge

A friend thinks you can answer this question about AI Experiment Design

You're rolling out a new LLM that costs 3x more but seems to give better answers in offline eval. What's the right way to validate it in production?