A classic data problem is taking hold as AI gets ready for production

Data lineage is one of the new hot topics amongst developers trying to get AI tools out the door. Plus: GPT-4 is still a coding tutor, and don't just take it from me this time.

Dec 14, 2023

∙ Paid

a woman in a lab coat holding a piece of string, she follows the string down a hallway, the floor is covered with papers and computers, --ar 4:3

Author’s note: some open source stuff happened in the last week, which we’ll get into in the next issue—I’m waiting for more endpoints to be available and pricing for them to come out to dig into this a bit more.

While we haven’t quite gotten to broader adoption of generative AI tooling, when it does happen every company is generally dreading the same thing: some model committing some kind of gaffe that they have to scramble to manage.

And given that AI models are kind of unpredictable, a lot of energy is flowing into building walls and guardrails around it to prevent it from happening. But increasingly a new (old) paradigm comes up as a way to both try to get ahead of said gaffe and figure out how it happened in the first place: lineage.

Yes, here we go again! Increasingly we are seeing the same challenges that plague data engineering and management starting to weigh on the AI development process. And that’s largely for a good reason, as AI is still at the very front end of a transition from an experimental-slash-hype phase into a production-ready phase. (And we are actually starting to see signs of that as most startups and companies I talk to are seeing consumption tick up.)

But to get there, AI development has to grow up. And a big part of that is managing the data that flows to and from those models, whether it is what goes into training and fine tuning or what’s retrieved for prompts. And that includes an increased focus on data governance—or, more specifically as of late, data lineage.

While governance is a kind of catch-all term, lineage is more precise and simple: what data is moving where, and for what purpose. That involves tracing calls through pipelines made by a variety of different applications, such as some tool summoning a data set from Salesforce to pass it into Snowflake.

A lot of the value of data lineage is retrospective—my app broke and I need to know what happened. But the emergence of broader strict regulatory frameworks (like GDPR and the CCPA) made proactive lineage much more important to a much broader audience. That led to lineage essentially getting a preferred slot in the modern data stack in the orbit around Snowflake or Databricks, though it was typically thrown under governance umbrellas.

While AI models aren’t necessarily concerned with data pipelines breaking the user experience (though you could include fetching information for a prompt through retrieval), the principle is pretty similar and in this case a much more proactive one. Higher-quality data leads to better outcomes, but there are obviously lots of questions over what data can and can’t be used in the context of a language or diffusion model.

“Lineage was already a very hot topic, and you could make the same case for model lineage now, where you have to really care about the inputs and outputs for different models instead of the typical problem of if your data is missing,” Julian LaNeve, CTO of Astronomer, which is effectively the steward for the orchestration tool Airflow, told me. “You care about prediction, you care about accuracy and loss scores. I think you can, and probably should, draw a lot of parallels in the same way Airflow and Astronomer is evolving more data.”

The more things change, the more things stay the same, right?

But once again we’re seeing an increasingly common trend of data engineering and management principles extending into AI, though not always in the exact same form factor. And some startups already working with the problem are finding themselves explaining it to a whole new audience.

How data lineage became top of mind in AI

Classic lineage—going all the way back to Informatica days—wasn’t just about ensuring the validity of data as well in reports that would end up in board meetings. Industries with stricter regulatory requirements or more stringent customer privacy needs have very clear documentation about what data was being used, where, and how—and how companies moved in the first place.

“Data lineage has always been a priority for use cases that need clear explainability for decisions or figures reported,” Christian Kleinerman, SVP of product at Snowflake, told me. “Regulatory reporting or implementing compliance with privacy regulations are examples that have traditionally relied on knowing the lineage of a data set. Generative AI is broadening that set of use cases where lineage is important.”

A classic data problem is taking hold as AI gets ready for production

Data lineage is one of the new hot topics amongst developers trying to get AI tools out the door. Plus: GPT-4 is still a coding tutor, and don't just take it from me this time.

How data lineage became top of mind in AI

This post is for paid subscribers