How the GPU crunch is forcing the AI developers to adapt
Nvidia's long game to own AI development has finally paid off as it wields an enormous amount of power. How are smaller developers trying to get around the scarcity of GPUs?
This past week, Weights & Biases held a conference for more than 1,000 attendees in the machine learning space. It announced several new products—including an open source framework called Weave for Python for managing machine learning data apps.
But at the conference, one subject seemed to be universally on everyone’s mind: the Nvidia GPU crunch.
Nvidia’s hardware, particularly its A100 and H100 series, has become the standard for training and inferencing large machine learning models. That’s partly because of PyTorch’s dependence on Cuda, which is tied to Nvidia hardware. PyTorch has become the dominant machine learning framework, and has carried Nvidia along with it. And as a result, Nvidia can’t keep up with the demand.
The investor public discovered that Nvidia, several years into building AI hardware, is now an AI hardware maker as of this quarter, rocketing it to a near-trillion dollar valuation. But its dominance in AI has led to all of the biggest machine learning developers clamoring for its hardware for some time now—particularly its new H100-series, in which I’ve heard companies like Inflection are working clusters as large as 10,000 nodes.
The Information in March also reported that Microsoft was rationing access to A100s for internal teams. I’ve even heard Google’s TPUs are harder to snag for Googlers internally these days, who historically had readily available access to them.
This scarcity of hardware—particularly Nvidia hardware—has significant ramifications for the machine learning ecosystem as the large foundation model developers, like Inflection and OpenAI, fight for access to it. And it’s led smaller developers to get much more creative in the way they develop and deploy models.