The point of lightning-fast model inference
We're obsessed with generating thousands of tokens a second for a reason—and it isn't just to wow end users with text showing up on a screen really fast.
When ChatGPT first came out, it started out with—and popularized—a fun user experience to go with getting a message back from the early AI product: having the words print out in sequence as they are being generated.
These tools built on language models don’t have to do that. You could instead wait for the whole response to get generated, and then show the response. But for one reason or another, that experience largely stuck with standard interfaces for apps built on top of language models—even as the speed of token production has become so high that you probably can’t even discern it at this point.
But while there’s some threshold where the speed at which tokens are generated just kind of stops being visually noticeable or helpful, there’s another reason all this is happening under the hood. These responses generated at lightning speeds aren’t built just for humans—they’re built for the bots that they’ll be talking to in the future.
“I don’t think the interesting work is human-read in the future—the interesting work is machine-read,” Andrew Feldman, CEO and co-founder of Cerebras Systems, told me. “What you’ll see in the future is concatenations of models, where the output of one is the input the next. That latency stacks. If you wanted to link 6 or 8 of these together, you wait a minute to get an answer. What we know is nobody waits a minute.”
Cerebras Systems, a developer of custom AI chips (that are also colossal), is one of the latest companies to step into this blistering speed race with its new product, Cerebras Inference. Feldman tells me that you’ll get speeds above 1,800 tokens per second on the smaller Llama 3.1 8B model, and 450 tokens per second on the larger Llama 3.1 70B model. And Cerebras’ product inferences both models at full 16-bit precision, rather than a compressed 8-bit model that is often the default option—particularly for the larger models.
This all becomes relevant in the context of (the current semi-fever dream) of agents—a collection of models that can operate more autonomously and solve complex tasks by passing them amongst one another until they return a result. Each individual model requires a single call, and a decision on where to route that result, which can hypothetically balloon with the complexity of tasks. And all this doesn’t even include the possibility of failures in a chain of operations, such as a prompt rejection (such as due to it returning personally identifiable information).
Cerebras Systems is one of a number of companies looking to exploit architecture alternatives to Nvidia’s hardware in order to satisfy a very broad set of use cases. And they are trying to demo the lightning speed of all of these operations, much like companies like Groq and SambaNova Systems. Inference platforms like Together AI and Fireworks also try to push the envelope on tokens per second.
And while it all moves faster than a human eye can read, it turns out that this blistering speed is just one of what are likely many precursors to the ability to build out networks of models that can satisfy some of the potential dream scenarios for language models.
“This new processing power is a game-changer,” Jonathan Corbin, co-founder and CEO of Maven AGI, told me. “It allows AI agents to handle vast datasets in real-time, make more nuanced decisions, and adapt quickly to new information. We view this as crucial for developing AI agents with human-like understanding and responsiveness.”
The case for agents and speed
Right now we’re a considerable ways off from creating some kind of one-size-fits-all “agent” that can perform a task of more arbitrary complexity., That could either be through the use of some model with insane reasoning capabilities or a long, long chain of models strung together.
But there is already a lot of low-hanging fruit that current, off-the-shelf hardware has the potential to resolve. More specifically, these language models—particularly the smallest Llama 3.1 8B model—are able to solve compact tasks that involve processes like classification, summarization, or entity extraction. Each of these excel in customer service use cases, particularly when routing problems and determining whether to rope in a human to resolve it.
The complete base case of all of this is having a “robot” complete each of those tasks in isolation, and then moving it onto the next robot to complete the next incremental step. The classic term for this is, aptly, robotic process automation (or RPA). Except instead of having a bot click around on a website to test something, this process is taking in information and determining what to do with it at a much higher level. These are the same types of tasks companies were using older language models, like BERT, for—they’re just substantially more advanced and can handle more complex problems.
Klarna, the company people reflexively point to when it comes to the potential of dropping AI into customer service, reiterated in an earnings report that the use of AI was able to perform the work of more than 700 employees, and reduce the average resolution time for a customer service case from 11 minutes to just 2 minutes. It might be boring, but batch processing is largely where there’s a lot of value to extract—and waiting a fifth of the time to resolve some customer service issue seems pretty helpful!