Model shelf life and the AI ouroboros
If leaderboard turnover is any indicator, enterprises won't get "state of the art" open source models in production.
Author’s note: Now that I’m about three months into the development of Supervised, I’m going to continue my work to make this a sustainable independent journalism publication—with the hope of one day becoming a newsroom, and not just a newsletter.
Beginning August 25, two of the three weekly issues of Supervised will be available for subscribers only, while readers will receive a short preview of the post. The “Still on my Radar” section will now only appear on Fridays, and it will be only available to paid subscribers.
Thank you to all of my readers, and please continue to send feedback and suggestions (and tips)! You can reach me by the email or Signal number at the bottom of every post.
In addition, due to some travel next week, Tuesday’s issue will be moved to Thursday. There also won’t be an issue on Wednesday due to a semi-overloaded schedule.
Turnover in open source “state of the art”
The performance of open source models is increasing at a rapid pace now that Llama 2 has hit commercial-ish availability. You can see it just from the sheer level of turnover on the Hugging Face open LLM leaderboard.
That speed of development opens up an interesting challenge for companies exploring deploying open source models. There are a lot of benefits to using open source models over APIs—including, more recently, concerns around the reliability and uptime of GPT-4. But the turnover in what’s considered “state of the art,” if we’re talking existing benchmarks and the Open LLM leaderboard, is incredibly high.
For larger enterprises with longer adoption cycles, by the time a model is actually in production (either internally or within a product) it’s probably already been beaten on the Hugging Face open LLM leaderboard by another open source model. Or, more likely, it’s been beaten several times over.
Essentially since the launch of Llama 2 we’ve seen leaderboard toppers that have a “shelf life” of less than a week, if we’re going off the initial commit dates for all the models evaluated on the Hugging Face Open LLM leaderboard and their scores from the leaderboard evaluation. And these models are all within less than a point of each other upon release.