OpenAI (sort of) covers one of its blind spots
OpenAI now has a batch processing API. But this time around, it’s dealing with more than just a handful of startups—including Snowflake and Databricks.
Author’s note: Due to the busy nature of the week with multiple events and the launch of Llama 3, Friday’s column will be moved to early next week and be a double-issue.
OpenAI is adding another one of the missing pieces in its toolkit that other providers were quickly running off with—and this time it handles one of the most important pathways to enterprises actually putting language models in production.
Last week I’d noted that most companies that I’ve been talking to lately have been using modern language models as part of a batch data processing workflow. OpenAI this week announced the release of its batch processing API, which enables users to process large amounts of data on a less urgent timeline—but in exchange you get higher rate limits and half off the standard API price. (And it looks like OpenAI has finally adopted the per-million token pricing structure instead of per-thousand.)
Companies process large amounts of unstructured data, such as customer support tickets, and execute simple tasks such as classification, sentiment analysis, or summarization, on a longer timeline—typically using smaller models like Mistral 7B or Llama 13B on timescales that are hours (or even days). It’s not so dissimilar to what companies were already going with BERT, an early open source language model, and it’s now more efficient and effective.
OpenAI, the first mover in making language models broadly available through its GPT-series APIs, had effectively sat on the sidelines while companies like Snowflake and Databricks—hosting these smaller open source models adjacent to company data—built up what seems to increasingly be the standard use of language models in enterprises. While you could fire up a smaller Mistral model on top of your data, GPT-3.5 Turbo was still $0.50 per million input tokens and $1.50 per million output after its latest price cut.
In fact, when you talk to partners, platforms, or the enterprises themselves, batch data processing is the clearest case of “AI in prod” you’re going to get right now. The whole dream of some mythical autonomous agent replacing entire teams has largely faded into the background, and in its place is a bunch of pretty generic and boring use cases that actually return a disproportionate amount of value.
As is the case with many of those modalities and business models that OpenAI faces increasing competition from startups (and, somewhat less obviously, platform providers), it’s adding the API to its repertoire in a kind of better-late-than-never approach. But it looks like its own “flavor” here is that it’s offering a cheaper versions of its portfolio of models, rather than just one or two smaller ones.
There are plenty of reasons why OpenAI would want to do this, least of all to keep up with potential customers turning to others for cheaper batch data processing. But it also offers another interesting opportunity to sate another emerging challenge for companies that manage hordes of GPUs: getting as close to maximum utilization as possible.
This time around, though, OpenAI isn’t just dealing with a handful of startups trying to eat away its cost advantage on the edges. It has to contend with the challenge that enterprises handling batch data processing are the same ones that probably have their data already stored somewhere that offers it in some fashion.
The appeal of batch data processing
When a request that’s uniquely suited to a language model isn’t particularly urgent—like an angry customer on the line—it turns out you have a lot more language model options at your disposal. You could instead comb through all of those angry customer calls, using a less-powerful model to extract basic insights from them to understand why you product might be broken in the first place. A model with GPT-4 level quality isn’t really necessary, nor would it be particularly practical from a cost perspective.
A lot of these problems end up getting routed through centers because they have a kind of "eye test" feel to them, where you couldn't just directly extract the result with a standard machine learning model. Companies like Snowflake and Databricks have been implementing access to open source models like Mistral's, providing those kinds of tools right on top of a company's pool of data.
Realistically, a company could upgrade up to a more powerful model for batch processing at some multiple of the price of a smaller model like one from Mistral or Meta (or any of its open-available fine-tuned offshoots). With OpenAI’s batch pricing, it brings it more in line with a company looking at using a Mixtral-type model. And, again, the benefit OpenAI has broadly speaking is that it’s just incredibly easy to use.
So we’re going to look at two different price breakdowns—one that feels a smidge closer to reality for enterprises with batch processing already in place that are looking at using smaller (especially customized) models, and one that probably better reflects the reality of people that might use the OpenAI batch API.
As a usual caveat here, most of the value of these smaller models is unlocked through customization. So a raw endpoint-to-endpoint pricing is a little fuzzy as fine-tuned versions are hosted and have different pricing principles (such as GPU-per-hour pricing). OpenAI does offer a fine-tuned GPT-3.5 Turbo, but it’s on the order of $3 per million input and $6 per million output.
But, we’re working with what we have, and the smaller models are already somewhat capable of doing some of these hyper-specific workflow tasks (like less-complicated summarization) before fine-tuning. Let’s start with the comparisons with some endpoints of Mistral 7B on other providers to OpenAI’s GPT-3.5 Turbo model at a 50% discount:
As you’d expect, the pricing here doesn’t exactly lend itself well for OpenAI. There’s certainly an upper bound to the performance of these models and some additional complexity that comes with hosting a fine-tuned version, which would bring the performance for specific tasks more in line with what a GPT-series model can do from OpenAI.
OpenAI, however, is probably looking for a bit of a more favorable comparison here—with GPT 3.5-Turbo’s quality more in line with that of Mixtral or Claude Haiku. If we were to look at a comparison amongst those models…