Please send OpenAI your good vibes as it dives into model evaluation

OpenAI has decided it's time to try to handle one of AI's existential crises. Maybe this time it'll work?

Oct 01, 2024

∙ Paid

a sleek friendly pixar-styled white robot trying to shut the door to a closet that is filled with hundreds of millions of documents, the documents are starting to fall out of the room as the robot shoves the door closed, sunday funny comics aesthetic —midjourney

Around the time of this post I suffered a severe lower back injury that has pretty much prevented me from working. Paid subscribers received a 1 month extension and billing was fully paused beginning on October 15. In keeping with my policy, subscribers are not paying for what they are not getting. I’ll provide a further update in a few weeks once I’ve had an opportunity to treat the injury. My deepest, deepest thanks for all of your patience.

While OpenAI’s advanced o1 “stop and think for a second” models may be a technical feat, it’s also increasingly working to enable developers onto its significantly cheaper workhorse models.

As part of this, OpenAI has to wade into one of modern AI’s comically difficult problems: evaluating whether the response of a model is “good” in some quantitative and automated way. The industry has long tried to graduate out of vibes-based evaluation, but as of yet there’s no immediately reasonable solution. And yet, here we are, with OpenAI trying to do that as part of its new distillation tool.

Its distillation tool is one of a series of new developer tools OpenAI unveiled at its DevDay. Its first, which has a much bigger wow factor, is the availability of their real-time advanced voice assistant as part of a broader “real-time” API. But there’s also a additional features that focus on mitigating usage costs and increasing performance without having to drop the cost of their APIs.

Its distillation tool effectively uses larger models to fine-tune a smaller one (like GPT-4o mini). But OpenAI is trying to wrap it into one kind of “productized” package that streamlines the process. Developers can used a stored completions feature for exactly what it sounds like, and manage the evaluation and distillation directly through a user interface rather than integrating some complex suite.

OpenAI is also releasing a prompt caching tool that cuts costs for ongoing repeated content within a context window. Added together with its distillation suite, the updates are a roundabout way of finding a way to shave off what developers have to pay to deliver a high performance AI-powered product.

OpenAI has long run a playbook of being extremely convenient both from a usage and a procurement standpoint, even if you incur a cost penalty by ignoring the alternatives. Enterprises, though, have become much more concerned with costs (OpenAI’s in particular), though. As a result OpenAI’s convenience advantage has started to become a bit more brittle in the face of a large list of alternative and highly competitive products.

So the next step is clearly to find some way to add different layers of efficiency to ease the cost burden on developers and enterprises. It also had to find a way to drop the difficulty of some of the more advanced techniques to get costs under control. That also meant building a product that wades into everyone’s favorite ~~existential crisis~~ development challenge in the form of model evaluations.

The realtime API is cool as hell and is going to be useful for a lot of potential product cases. But, setting aside that for now, let’s get to the more in the weeds bits: how OpenAI is going to shave off costs at a time when it’s under pressure from all different directions with competing (and cheaper) products.

Getting into the mess that is model evals

There are two direct routes that OpenAI is now deploying to find an optimal point between performance and value for both developers and for OpenAI. One is to collect as many possible efficiencies as you can when a model is actually in use. The other to funnel as many people as you can to a workhorse model—like Gemini Flash or GPT-4o-mini.

Its prompt caching tool, where it’ll offer a 50% discount and faster prompt processing times on input tokens that it’s already seen recently, is playing a little bit of catch-up. This is becoming a pretty popular feature for multi-turn products that have a lot of information stuffed into a context window, such as large documents, code bases, long customer service interactions, or tools that require more advanced retrieval.

This feature was already available in Claude and Gemini, though they came in different flavors. Anthropic charges slightly more to write to a cache, but input tokens from that cache are 10% the cost of normal input tokens. Anthropic’s prompt cache expires after 5 minutes, and that clock restarts every time the cache is used. Gemini’s context caching meanwhile also offers a substantial discount to normal input prices and charges an hourly rate for every million tokens cached (with no upper bound on the cache expiring).

OpenAI—in its bid to try to simplify the experience as much as possible—basically sweeps all the complexity under the rug and applies the discount and performance bump right away while the cache window is active. As of today, it’s automatically applied to its latest models, with developers not having to really do much.

OpenAI’s distillation “suite” includes enabling developers to store completions in order to generate a large set to fine-tune its models. It also includes that evaluation tool, in which users can use a bunch of off-the-shelf evaluations (like groundedness) or design a custom evaluation. In that way, enterprises can pull in some of their custom evaluations and link them together directly within a single OpenAI interface right on top of completions.

Please send OpenAI your good vibes as it dives into model evaluation

OpenAI has decided it's time to try to handle one of AI's existential crises. Maybe this time it'll work?

Getting into the mess that is model evals

This post is for paid subscribers