AI in January: endpoints everywhere, the race to the bottom, and models on the edge
Plus: Another odd AI leak over the weekend.
Supervised will be on a limited publishing schedule until 2/26 as part of a partial leave. Paid subscribers, including new ones within the next three weeks, will received a comped week’s subscription. I’ll provide comped duration proportional to the missed number of issues (one week per every 2 issues) to existing paid subscribers when I resume a full publishing schedule in the last week of February.
The first month of 2024, already shaping up to be a very hectic year, is a wrap—and it’s started with open source models gaining an enormous amount of ground on OpenAI.
Generally the year seems to be pointing in major developments in two directions: open source models gaining more prominence through endpoints, and those same models (and potentially closed-source ones) operating closer to edge devices. We’ve already seen a lot of developments on the latter to the point that it’s technically already possible, though a little unwieldy.
The former, however, has effectively triggered what we all expected as open source models catch up with proprietary ones and more competitors arise in the closed-source API ecosystem: a rush to push prices as close to zero as possible.
The immense pressure on pricing for APIs more or less started with the release of Mistral’s Mixtral mixture-of-experts model last year, where it was more economically feasible to run them at a substantially lower cost than OpenAI’s GPT-3.5 Turbo. But developments in other APIs, such as embeddings, are putting immense pressure on OpenAI (and others) from a pricing perspective.
So that’s part of what we’ll dig into in this month’s recap, including one that’s on the horizon: what distributed inference might look like, and how the databases and cloud providers fit into that.
Here are the three broad strokes from January:
The proliferation of AI endpoints. There were already inferencing startups like Replicate for some models. But we’re starting to see a lot of companies that aren’t built on inferencing starting to get into inferencing—and not just on the LLM front. That became more evident with the launch of a new type of model from Together AI.
The race to the bottom for AI is getting started. While it was expected to happen at some point, we’re starting to see a one-upsmanship emerge between companies on the pricing front for their endpoints. Price cuts are everywhere, with one startup cutting their prices just a day after OpenAI announced price cuts for theirs.
Local edge models coalesce into a paradigm for distributed AI computing—and familiar headaches. Models are moving closer to edge devices to the point that you can inference a pretty good LLM on a laptop or weaker Nvidia card. The next obvious step is going in the opposite direction: pushing those results back to an end product. And the tools for that are starting to come together.
With that, let’s get started with what was obviously coming—though maybe not as soon as we would have anticipated.
The race to the bottom for AI pricing
While GPT-4 is generally still considered the best one-size-fits-all model, others—particularly in open source—are rapidly catching up to its more performant GPT 3.5-Turbo model. The release of Llama 2 70B gave us one that was in range-ish, while Mistral’s mixture of experts pretty much gave OpenAI’s GPT-3.5 Turbo direct competition.
All the jockeying amongst each other is finally triggering what we expected to happen for a long time: the costs of inference racing to zero. And there are a handful of startups—particularly Together AI—that we can thank for that happening.
In fact, it seems increasingly clear that Together AI (one of Chris Ré’s plethora of startups) is becoming one of OpenAI’s biggest potential rivals despite being a bit over a year old. It’s able to ride the open source wave and pick off the best possible models—LLMs or otherwise—and serve them as endpoints (as well as provide fine-tuning expertise). Feedback on performance I hear from users of their endpoints is generally very positive.
We can look at how things have evolved over just the last few quarters and see that, broadly speaking, prices for models in the general range of OpenAI’s GPT-3.5 Turbo have continued to drop—including a price drop from OpenAI itself. (The shaded area in the chart represents the cost of input tokens for each designated model, as some providers have a separate pricing structure for input vs. output.)