The on-device AI era is closer than it seems
One of the dream outcomes of AI development is sneaking up on us.
Since the launch of ChatGPT, AI development has seemingly veered off in two directions: pushing the envelope on powerful, hungry models like Gemini Ultra and GPT-4; and squeezing as much performance as possible out of smaller models that don’t need OpenAI-grade hardware like Mistral.
The open source community—mostly on Hugging Face—has been moving the latter forward, often times quite incrementally, built on top of releases from Meta and Mistral. But that stead progress has been getting us closer and closer to one of the dream outcomes of AI that for a long time felt quite out of reach: a locally run, highly customized app that you can take anywhere.
What has changed in the last month, though, is that a powerful model highly competitive with GPT-3.5 Turbo in its natural form is now readily available: Mixtral. And while there’s always a tradeoff when it comes to trying to optimize it for smaller, less powerful devices, it still demonstrates a considerable amount of progress.
I’ve spent the past few weeks messing around with quantized Mixtral models available on Hugging Face and, at this rate, a performant-enough pocket personal assistant feels way closer than it seems. Part of that is the improvement in models and architecture like Mixtral. But potentially a bigger part of it is just how easy it is to get it up and running.
The complexity of running local models is dropping from every possible angle. Llama.cpp has effectively provided an engine for running them, while Langchain and other tools offer packages on top of it to wrap them into workflows. The hardware requirements to run models is dropping and their composition is changing, making it possible to run them on less-powerful devices with less memory. And on top of that, coding companions are available to stitch all that together in the first place without that much work.
The whole process of making this possible in the first place is creating opportunities for startups everywhere. GGML.ai is designing tooling to get it running in the first place. LangChain has pretty much become a default starting point for stitching together all these tools and orchestrating them. Chroma provides a lightweight vector database that you can spin up locally. And then there are a growing cluster of wrappers and skins on top of everything to reduce the complexity even further. The list is only going to continue to grow with the sheer number of niches that remain to be filled in this whole process, particularly around data preparation.
Amid this, we still haven’t even seen what Apple and Google are expected to do here—other than that they are doing work with local models. Google’s Gemini-series models included Gemini Nano, a smaller version designed to work on Android devices. Apple has thus far been relatively quiet, but recently put out research around techniques to inference language models with limited memory.
The same models that are run on local, less-powerful devices could then easily be scaled up in the cloud. And because they’re optimized for those kinds of devices—think even something like a Raspberry Pi—the cost of scaling up something built locally on less performant cloud hardware on any of the abstraction layers is going to be considerably lower than deploying those tools out of the box on high-powered GPUs.
While APIs are always going to be useful thanks to their ease of use—and how cheap they are getting—there’s also a big opportunity for running local models that don’t chew up A100 hours. They can be a kind of fun personal assistant, but they can also potentially handle requests at a much larger scale thanks to the lower hardware requirements.
We’re effectively barreling toward a future with local models running everywhere, and it actually seems a lot closer than what it looks like on paper right now. And one project picking up a lot of buzz in particular seems to be one of the bigger indicators of how close that future actually is: oLlama.
The dropping complexities of running a local LLM
Even an entry-level Macbook Pro can have a relatively performant model—Mistral 7B—running locally through a package like Llama.cpp and oLlama. While Llama.cpp has been available since the beginning of last year, oLlama essentially removes a lot of the technical complexities of running it.
oLlama isn’t just picking up traction on GitHub and among the machine learning engineering community. It’s a project that seems to have picked up some buzz among investors, even if it carries some of the same feature-vs-product baggage that Langchain carried for a long time. But oLlama is also increasingly seen as a potential path forward to language models running on edge devices more broadly, and not just on laptops.
oLlama has started to pick up a somewhat Langchain-y growth curve, if we’re going by Github stars. And Langchain, which already offered ways to integrate local model usage through a variety of packages powered by Llama.cpp, integrates directly with oLlama.