The point of endpoints
Startups have a new way to tackle idle GPUs: flooding them with tokens. Plus: Apple's potential license for Google's Gemini.
Due to catching up on a whole lot of reporting debt last week, all paid subscribers received another comped week to make up for the lost issues.
Double feature today! First we’ll (for free subscribers) address the Apple/Google potential partnership. Then we’ll go into some of the reasoning behind the sudden proliferation of endpoints serving Mistral models from all these companies.
Apple seems to be repeating history with Google
Apple’s bizarre reckoning with generative AI seems like it’s about to take another very weird—but familiar—turn.
Mark Gurman over at Bloomberg is reporting that Apple is in conversations with Google to hand over some of the heavy lifting for generative AI over to Gemini. This comes at a time when Apple is making some peculiar, if a little tangential, releases in AI—particularly in the form of its MLX development framework and recently detailing its own model.
While Apple has been trying to make this work, it seems to be running into a wall simply because we haven’t seen anything from them yet and we’re well past the one-year mark of the launch of ChatGPT. Siri is still a terrible product, and the best we’ve gotten so far is some better updates to Autocorrect and promises that more in on the way.
This wouldn’t be the first time Apple has thrown up its hands in a space it can’t figure out and decided to just play nice with Google for now.
In 2012, Apple announced that it would be launching its own native Maps app that was historically powered by Google. It went really poorly, and Apple allowed other maps apps like Google and Waze on the App Store. Apple’s famous Maps disaster essentially allowed Google to become a core part of the iPhone with its own apps while Apple scrambled to play catch up with its own core options.
But despite effectively admitting failure and allowing users to download Google Maps because its own product was terrible, Apple continued to aggressively invest in maps. Pretty much every annual WWDC comes with a Maps announcement of some form improving the data and fidelity. Apple also effectively handed over search to Google both initially (and currently), though it continues to invest in its own quasi-search product with Spotlight.
And while its product(s) might still be inferior to Google Maps (which is kind of a subjective thing), the investment in Maps has fit neatly into its marketing in that you aren’t necessarily dealing with an ad-supported Maps product and rather a premium feature on a premium phone.
The extension here feels pretty natural: Apple already has a partnership with Google to have it as a default search engine in Safari, and AI has the potential to blow up search altogether. The Bloomberg report notes that the partnership would be for a service to do the “heavy lifting”—as in, something that its own silicon couldn’t handle.
And, to be clear, Apple’s silicon can handle a surprising amount. One project—MLC LLM—can get (much) smaller models running locally on iOS. Advancements in shrinking the smaller models to run on less powerful devices (a process called quantization) are only going to improve the experience.
But Apple can’t just sit there and do nothing while it figures out how to effectively deploy diffusion and language models. And working with Google (or OpenAI or others) offers a kind of stopgap to capture the upper end of use cases like generating images or managing sophisticated prompts with quality requirements.
One of the challenges Apple has likely faced is because of one of the iPhone’s core value props for the iPhone: privacy. The pitch is that data stays on your phone and everything happens locally rather than phoning home. And as a result that might be working against Apple while Meta and Google hoover up a ton of data to improve their products.
It feels like there are a few directions that this deal could go in:
Apple licenses the weights for Gemini. This is the kind of boring bare-bones version of a partnership where it gets access to some set of the Gemini core model that it could host. Google has its own pocket version of Gemini in Gemini Nano designed to live on-device. This feels a little less likely if it’s also talking to OpenAI, according to the Bloomberg report, because they could also just pull a Databricks and company and use Mistral models.
Apple integrates more “advanced” products where needed, with sufficient disclaimers for users. Search is obviously the best example—users know what they are getting into when searching with Google, but accept the tradeoff to get the results they need.
One thing Apple does have going for it, though, is that the actual killer use cases for AI aren’t fully baked just yet. Meta has image generation in its chat apps (which the expectation of privacy is a little… low), but it’s not clear if that’s completely altered the way users act within messaging apps.
But its users want to see something happening in AI, because it’s happening everywhere else. And while Apple released the Vision Pro (to mixed reviews), it is still trying to show it’s an innovative company that can respond to user demands. Apple is historically slow to adapt to user preferences, like with the adoption of 5G and building larger phones.
But if it turns out that AI is indeed a once-in-a-generation technology that will redefine the way we use devices (like location services with Maps in 2012), it has to figure out something in the mean time.
And why not work with an established partner while the industry is still very much in its infancy while you figure your stuff out at home?
Why endpoints are showing up everywhere
Toward the end of last year and early this year, we saw a flurry of activity around companies throwing up APIs for open source models as alternatives to OpenAI and other large foundation model providers.
The spectrum ranges from companies dedicated to inference (like Replicate), fine-tuning and serving (Together AI) all the way to next-generation search engine Perplexity’s Labs product. In some cases it might feel like they’re a bit odd that there would be an inference endpoint available, particularly as the abstraction layers Snowflake and Databricks pounced on the opportunity in the last week or so.
But while these endpoints offer competing tools to hypothetically make them an OpenAI alternative-ish tool, another potential use I hear about a lot more often these days from those in the industry serves a much more utilitarian purpose: tapping unused compute.
Getting your hands on these GPUs requires either the direct upfront investment in the hardware—which could cost tens of thousands of dollars—or reserving a spot and paying per GPU hour. And while those GPUs aren’t doing anything, it’s basically a lost opportunity to generate revenue. Recouping the cost of an H100 means you’re generating value out of said H100 more quickly, while the distance between current and full utilization during a GPU running live is simply empty space that could be filled and generating value.
But like any growing business, you want to acquire capacity ahead of the usage arriving. Not doing so runs the risk of collapsing under the weight of your own success and users fleeing to another product—at a time when for any given use case there are probably a number of companies going after the same thing.
One solution, then, is to flood that empty space with tokens. And these endpoints have quickly emerged as a rather effective way of filling up that empty space and recouping the cost—and getting to value faster—of getting your hands on the GPU while you wait for when you really need it during, say, a training run or an extreme spike in usage.
The release of all these endpoints offers an increasing smorgasbord of API options outside of OpenAI and its neighbors—often at much cheaper costs per million tokens. Together AI offers a whole suite of highly performant (and cheap) open source models for both LLM inference and embeddings, bringing it oddly closer in direct competition with OpenAI.
The proliferation of these endpoints to try to use up all that empty space has effectively triggered a race to the bottom for zero-shot LLM-via-API pricing, with Mistral’s Mixtral model being the primary trigger. While Snowflake and Databricks, as of this week (more on that in a second), support Mixtral, it’s companies like Together AI and Fireworks that are pushing for the lowest-possible cost.
Having those endpoints available also broadens the surface area for companies like Fireworks or Together AI to get additional users signing up for their tools. Having more generalized products provides a nice way to build up consumption while creating a ramp toward becoming more of a direct infrastructure company.
And once a customer runs into a wall on the raw per-token API, they can make a strong pitch to onboard them onto more advanced products—such as higher performance, better availability, or the ability to customize them through fine-tuning.
There are plenty of other incentives to releasing all these endpoints outside of just trying to build up brand awareness, offer competitive products to OpenAI, and open up new potential revenue streams. But at its core it’s tackling one of the biggest headaches for any company that somehow got its hands on a piece of Nvidia hardware: making sure it’s actually in use.
The challenge of idle GPUs
One of the bigger headaches for companies that need access to cutting-edge GPUs in short supply is that they effectively have to either acquire the hardware themselves, or they “save” their spot or run the risk of not having them when they need them the most.