Greedy models and Nvidia's open source blind spot
Nvidia is clearly making an enormous bet on powering high-end foundation models. Where does that leave smaller open source approaches?
Nvidia is once again blowing up the internet as everyone realizes it’s the foundational layer of one of the most important emerging industries in tech in decades.
This is largely thanks to the financial numbers coming out around its explosive growth in its data center division, which stems from an insatiable demand for its AI-optimized hardware—particularly its H100-series chips. Pre-trained foundation models generally require a lot of compute firepower, but some of the larger models need a high level of compute for inference too.
Right now, we’re in what we could probably call the “greedy model” phase of AI, and Nvidia is the single-largest beneficiary of that. “Greedy,” here, is not a negative connotation—it just means that they require aggressive levels of compute that only Nvidia can provide, for both training and inference. Modern training frameworks (particularly PyTorch) are reliant on Cuda, its developer framework powered by Nvidia hardware.
Greedy models are really the only ones that have any practical applications in AI at the moment because they are the only ones that really work out of the box. They work well in code generation settings and many companies are already applying them toward chat experiences. Open source models, while showing a lot of progress, are still unwieldy and difficult to implement—much less maintain.
Nvidia’s success has been largely predicated on AI-accelerated training hardware with price tags in the tens of thousands of dollars—and the larger clouds and model providers can not get enough of these. Nvidia is essentially the kingmaker of AI and has sent it north of a $1 trillion valuation to being one of the most valuable companies in the world. When you look at even smaller models, you can see how much power is required to get them trained up before even moving to inference.
This isn’t even just H100s, either. Meta’s massive research cluster it announced earlier this year used tens of thousands of A100-series GPUs. Llama 2, which has quickly become the underpinning of most open source models, was trained using A100 clusters. Model developers are taking anything they can get here, and are actually going to some rather funny extreme lengths to do so (which we’ll talk about another time).
We don’t have a lot of information around Pi, Claude, and some of the other large models, but what we do see is demand for H100 clusters from them in the tens of thousands of nodes. Nvidia is expected to triple production of its H100s, according to a report from the Financial Times.
And it feels like the stock bump from yesterday is really more of a general underestimation of the power required for these greedy models, and less so about how “big” AI is/will be in the coming years. AI is already “big” from a kind of cultural impact point of view—but it hasn’t arrived at its killer app moment, either.
One thing that seems to be not particularly on Nvidia’s mind, if we’re zeroing in on comments they made in this report (which as usual is all we have to go on), is smaller models—particularly those in open source and readily available for fine-tuning and deployment on decisively non-H100 grade hardware.
A blind spot in inference and pocket models
One of the first questions that came in here was from an analyst asking about the inference market. Nvidia CEO Jensen Huang’s answer was kind of dodge-y and meandering and focusing on large models. I’m just going to drop the full one in here, with some emphasis added:
These large language models are -- are fairly -- are pretty phenomenal. It -- it does several things, of course.
It has the ability to understand unstructured language. But at its core, what it has learned is the structure of human language and it has encoded or within it -- compressed within it a large amount of human knowledge that it has learned by the corpuses that it studied. What happens is you create these large language models and you create as large as you can, and then you derive from it smaller versions of the model, essentially teach-your-student models. It's a process called distillation. And so, when you see these smaller models, it's very likely the case that they were derived from or distilled from or learned from larger models, and just as you have professors and teachers and students and so on, so forth.
And you're going to see this going forward. And so, you start from a very large model and it has to build — it has a large amount of generality and generalization and — and what's called zero-shot capability. And so, for a lot of applications and questions or skills that you haven't trained it specifically on, these large language models, miraculously, has the capability to perform them. That's what makes it so magical. On the other hand, you would like to have these capabilities and all kinds of computing devices.
And so, what you do is you distill them down. These smaller models might have excellent capabilities in a particular skill, but they don't generalize as well. They don't have what is called as-good zero-shot capabilities. And so, they all have their own -- own unique capabilities, but -- but you start from very large models.
(As an aside, this is weirdly one of the only very public mentions I have seen about distillation, and it very rarely comes up in conversations. You hear a lot about quantization, low-rank adaptation, retrieval augmented generation (RAG), and other techniques to build out performant smaller models. There’s some work around using GPT-series models to generate instructions for smaller models, but that’s really the extent of how it comes up.)
If we’re reading between the lines a little bit here Huang is basically trying to make a bet on training and the inference of greedy models, which require the firepower of its more powerful GPUs. And it’s also a bet that modern AI development will also, at minimum, build on top of those greedy models trained on premium Nvidia hardware. (SemiAnalysis’ epic teardown of GPT-4 reveals just how much power was required for training and even a single inference across a large mixture of experts model.)