Pocket falcons and a quantized future

With the release of Falcon 180B, we're starting to see a unique realm of opportunity for squishing massive models to run on weaker devices.

Sep 07, 2023

∙ Paid

A large MacBook Pro sitting on a cluttered desk with an image of a falcon on the screen, angled photo

Author’s note: Friday’s issue will be coming out next week so I can catch up on some reporting and data analysis. Thanks everyone for your patience, and see you at Dreamforce!

The fall frenzy for AI feels like it’s kicking into a higher gear slightly ahead of schedule, this time with the release of another enormous available-ish model on Hugging Face.

Falcon 180B, a model from the TII Institute that is considerably larger than its first 40B model, popped up this week on Hugging Face after finishing up a training run that started at the same time as its smaller sibling. But one really cool project sprung up around it almost immediately: compressed versions of it to run on less powerful hardware through a form called quantization.

Shortly after the release of Falcon 180B (which came with a “we’re topping the leaderboard” announcement like the last one), a quantized version of Falcon 180B from Tom Jobbins—also known on Hugging Face as TheBloke—hit the Hugging Face hub. And GGML founder Georgi Gerganov, who is behind the Llama.cpp project to run these types of shrunken models locally, seems to already have it running on a Mac M2 Ultra chip.

Most Hugging Face users are going to be familiar with Jobbins’ work, particularly if you follow the Open LLM leaderboard at all. Jobbins uploads suites of popular models in a dizzying array of configurations. Quantized versions of Llama 2 models showed up on Hugging Face pretty much faster than you could blink.

I’ve written before about how Apple has this really unique opportunity to become a powerful (if niche) player in open source model development, and it feels like that starts with the emergence and growth of quantization. Llama.cpp helps developers get compact versions of considerably larger models running locally. Its hardware is quite beefy, in particular offering comically large amounts of memory (up to 96GB on a fully-loaded MacBook Pro and 192GB on a Studio).

That, of course, starts with developers tinkering around with what’s out there through what seems to be a growing support network for these kinds of developers. That can come in the form of investment, or donating compute. Or it could also look like somewhat unique bets, like a recent one by Andreessen-Horowitz.

Pocket falcons and a quantized future

With the release of Falcon 180B, we're starting to see a unique realm of opportunity for squishing massive models to run on weaker devices.

Seeding the ecosystem with support for developers

This post is for paid subscribers