What happens to Fivetran?
Let's examine where a structured data startup sits in an unstructured data world. Plus: a new open source framework for AI development.
We’ll be covering two topics today: a new framework in AI goes open source, and then why everyone is talking about Fivetran again.
Since it’s a holiday weekend we’ll be publishing Friday’s issue on Tuesday.
A new crack at displacing Cuda goes open source
One of my favorite questions to ask developers is why the AI developer ecosystem seems to naturally lend itself more toward open source.
The answers tend to vary a lot but they typically settle down on the academic culture around AI. That kind of culture has given birth to a lot of technologies, including two of the most consequential ones in deep learning: TensorFlow and, perhaps more importantly, PyTorch.
And indeed there are a lot of open source projects beyond just the open-ish models available like Llama 2. You can look at tools like LangChain and LlamaIndex, or vector databases like Qdrant and LanceDB. All of them are quickly gaining traction in one way or another, but we’re getting another entry today with the addition of Mojo’s core modules.
Mojo offers a tantalizing proposition of trying to effectively abstract out AI development and avoid having to deal with Cuda altogether with a new developer framework. It’s appealing for a variety of reasons, if only to address the grip Nvidia has over PyTorch’s dependency on Cuda (setting aside PyTorchXLA) alongside its shortage of performant hardware.
Modular, the creators of the Mojo developer framework, essentially want to provide a bridge there. But it’s only today that Modular said it would be placing its Mojo core under the Apache 2.0 open source license (with exceptions for LLVM). Mojo comes from co-founders Tim Davis and Chris Lattner, who both were behind Google’s TensorFlow.
“The foundation of the whole industry, going back to AlexNet, has been open,” Tim Davis, co-founder and president of Modular, told me. “Irrespective of if you have your opinions about large tech companies, the fundamental underpinning of the industry mostly has been open. We can talk about the future and how it might change, but in the past, things like the transformers architecture, PyTorch, JAX, TensorFlow, all the frameworks and underpinning infrastructure has fundamentally been quite open. That’s enabled the industry to flourish.”
Modular has raised $130 million from investors including GV, General Catalyst, SV Angel, and Greylock. Its latest round came in August last year when it raised $100 million (which was pegged at a potential $600 million valuation by The Information). And while Mojo offers a really appealing opportunity, a lot of the interest in the company is also from the people building it.
The industry’s academic roots have also been a small source of tension as it’s grown, particular as companies look to pluck the best researchers out of academia and into the industry. Researchers in academic institutions typically haven’t had to deal with building production-grade tooling and can fight over who beats the benchmark flavor of the month.
“People in the research world, have a raw incentive to move quickly, publishing research and getting it out out the world with pythonic code as fast as they can,” Davis told me. “There is a big difference between that as an intrinsic motivator, compared to a group of highly performant production engineers who are shipping prod level code to a group of customers thats expected to be highly reliable, performant, and execute 24/7 at extremely low latency.”
It’s part of why Modular is going after the classic multi-language problem problem in AI development as these tools have to scale and become more performant. Python spaghetti code has to funnel to some slightly more organized C++ and some algorithm gymnastics in Cuda. (This is often why people point to similarities between Mojo and Julia, though the two aren’t exactly comparable.)
Modular now has the benefit of letting developers run off with it and build out extensions and tooling that it can commercialize, such as through its serving platform MAX. The Mojo repository has a little over 17,000 stars, which obviously isn’t the kind of traction of a LangChain or a LlamaIndex (but have both been around for longer).
Still, getting developers to learn another new framework after PyTorch has become a core part of workflows for so long is a tall order. You can replicate the touch and feel of Python, but it’s still a bunch of new lines of code to write. But if anything, the frenzy around AI has shown developers and companies are more than willing to start from scratch with something new to move ideas into production (especially if it’s using Rust).
“There’s a huge pool of Python programmers,” Davis told me. “We want to see how many can we bring down to write high-performance intrinsics. Today that’s a talent problem, there’s not many people that can do that.”
Revisiting the modern data stack with Fivetran
Before ChatGPT came out a little more than a year ago, we had a very lively discourse happening in big data and analytics. Specifically it was about how the modern data stack had gotten completely out of control and sprawled into this very precariously duct-taped together pipeline that somehow, against all odds, got us an accurate(ish?) number for a presentation.
Well, we’re about a year past the launch of ChatGPT and a whole lot of AI models becoming extremely within-one-point-of-each-other-on-benchmark-X. That mid-range model segment has become a highly-expected race to the bottom price-wise (setting aside Claude Haiku, which is to be discussed at a later time). And, in general, the whole modern AI stack is kind of “baking” right now as companies start figuring out how things look in production.
So let’s go back to arguing about the modern data stack already!
Among a lot of investors and sources lately, that means revisiting a long-standing pillar of data pipelines: ingestion tools (ETL), and more specifically, Fivetran. And it’s really a simple question, too: what happens next to Fivetran with the growing need for unstructured data management in AI workflows?
Experts and investors I talk to for years have expected a kind of grand consolidation of the modern data stack. Part of that was an investor frenzy around developing new data pipeline tools to piggyback on the rise of data abstraction layers like Snowflake and Databricks. We were already talking about a bubble back then, and we had near-zero interest rates to go with it.
The emergence of production language model use cases delayed (or maybe partly spoiled) that expected consolidation, largely because a lot of these startups built around providing governance, quality assurance, and observability became a whole lot more interesting as potential pathways to language models within corporate workflows.
Fivetran, too, might end up becoming an interesting beneficiary to the emergence of AI thanks to one of its biggest emerging paradigms: the intersection of retrieval augmented generation (or RAG) and governance.
“Most companies probably won’t be doing fine-tuning, or even building their own foundation model,” Fivetran CEO George Fraser told me. “RAG is the right answer for these more corporate applications. You can implement permissions at the retrieval step, and you can provide links so people can verify.”