How AI could change the work of data engineers
A conversation with Snowflake's Torsten Grabs on the need for the people managing the pipes for AI models.
Author’s note: I’m still recovering from a rough injury so I’m working at mixed capacity. I am still targeting two paid issues next week. Apologies for any typos and any other assorted mix-ups in this post!
Update 10/19: Due to (still) working at mixed capacity, both paid issues for the week of 10/13 will be combined into a single one that will go out on Friday.
There are still no competitive models to GPT-4 from OpenAI. The company has effectively compressed the entirety of the available internet into a format that allows anyone to answer a question in a semi-reliable format.
But as companies start to figure out how to actually use language models internally—including beyond simple code generation tools—the real value will be coming from proprietary data. That’s part of the appeal of language models like Meta’s Llama-series: they can keep them behind a virtual private cloud and customize them using proprietary data, like a complete history of SQL queries.
An important part of that, though, is determining what data to throw into the process to adjust and customize those models. Pushing everything indiscriminately into a fine-tuning process doesn’t necessarily mean the best results. And instead some work—like Microsoft’s “textbooks are all you need” research—has shown that smaller but more targeted types of data sets could potentially provide an outsized value for more specific business use cases compared to cramming everything into a model.
But to get that working in the first place, a company has to have a data engineering function in place. That role isn’t old—they were responsible for the pipelines to get data into those classic supervised learning models and Snowflake tables.
And more eyes are on the role—and the tools that power it in an LLM-enabled future (such as Unstructured.io) to tackle increasingly complex problems around data engineering and data quality.
One company that has pretty direct visibility into some of this is Snowflake, which hosts the kind of data that thousands of companies will want to use to make these kinds of customized models—along with the ability to create a front-end interface for it with Streamlit. And while SQL generation is one example, many are interested in a kind of natural language interface for company data.
To learn a little more about some of the responsibilities and possible changes coming to the data engineering field, I spoke with Torsten Grabs, senior director of product management at Snowflake.
Here’s a lightly edited full version of the interview with Torsten Grabs, where we talk about data quality, data engineering, and how language models make it into the workplace:
OpenAI and some of the other foundation models have basically indiscriminately crawled the internet for data for their products. But there seems to be an increasing focus on kind of narrowing that scope of data collection, and focusing on higher quality data. What are you hearing from partners and customers about how important has data quality become given the emergence of language models?