From Data to Model: The Role of Data Engineers in the AI Lifecycle

Introduction

Today I'll be talking about something that doesn't get nearly enough attention in the world of AI, and that is data engineers. Boring, I know.

You see, AI is the shiny new toy that everybody wants to play with, but behind the scenes there is a lot going on and not everybody realizes it.

In fact, we've run into several real life examples where businesses wanted to implement AI, but they were not, for example, data ready. So I will go over that, what it means and what basically AI requires from the data perspective.

Speaker Background

Let me first introduce myself. So my name is Samo Kožuch.

I've been a data engineer for seven years. I worked in mid-sized companies, corporations, currently working in a very small company of 10 people.

So I've seen data stack from very evolved to basically starting building it from scratch. And I build it basically on every cloud, so AWS, Azure, GCP, and NSS.

I currently work at a company called Vertex and Google Cloud. That's where you're going to see a lot of logos. Sorry, Vertex, which is a Google Cloud partner. That's where you're going to see a lot of logos in my slides.

So yeah.

Initial Concerns About AI

So let me start with saying this. When I first heard the buzz about AI, I had a minor existential crisis as a data engineer.

You know, it was like, well, It was good while it lasted. Now I'll basically just hand over my job to ChatGPT, and it will write pipelines for me. And I'm going to do chess classes or whatever.

I mean, there was so much hype about AI that it's going to replace us all in basically three to five years, software engineers, data engineers alike, that who would have been concerned, really? Even my washing machine now has AI, probably a bunch of if statements really, but you cannot even do your laundry piece.

But then it hit me. AI doesn't run on magic. It runs on data. It actually requires a lot of data, clean data, structured data.

And that's when I realized that, hold on, if anything, AI is going to make my job harder, really. Basically, this was my reaction. First I was in panic mode because AI was here to take my job, then I calmed down because I realized I'm a data engineer, so I should be safe. And then I realized that yes, I'm a data engineer, I'm going to be managing the data, so I was in panic mode again.

So yeah, probably AI is going to make the data engineers much more busier.

AI Implementation in Companies

So as we saw on the poll that Melinda ran, basically many companies are already implementing AI. Some companies are still in planning stage. Some companies haven't started yet, but will do so.

Here's how it usually goes down in companies, or at least what I've run into. It's usually some C-level executive who hears the word AI, and suddenly they start throwing buzzwords like machine learning, neural networks, synergy, and... they think we need to hire data scientists right now and by end of next week we'll have a state-of-art um you know am model predicting stock prices or customer churn or i don't know maybe better data

So they hired the poor data scientist. They dropped them into the company. And guess what? There is no data pipeline. There is no data warehouse. There is not even clean data, just a mountain of spreadsheets, PDFs, data sources that basically make no sense.

And that's when the real fun begins for, or fun in air quotes really, for the data scientists because they are supposed to be building cutting edge models and instead they have to become a data engineer basically. It's like you're hired a chef to cook a gourmet meal but instead you give them a mop and a bucket and tell them to unplug the kitchen sink basically. So they're forced to clean the data, build the pipelines, and organize the chaos. And that's wrong.

The Role of Data Engineers

That's where we, the data engineers, should step in. Because honestly, we are the guys who live for this stuff.

And I saw a great metaphor on LinkedIn a couple of days ago about comparing data engineering using Marie Kondo metaphors. I hope at least some of you are familiar with Marie Kondo.

So basically, data engineering is Sorry, if the data blind points are the homes, they're basically the Marie Kondo's of the operation.

So our job is to look at the pile of chaotic unstructured data and ask, does this data spark joy? Or is this data some old crappy CSV file that should have been thrown away years ago?

So we tidy up those messy pipelines, we get rid of the junk, we organize everything so that it's easy to find and actually useful. Think of it as having a super neatly, no tidied up closet. Every data set is basically folded, color coded, labeled, and it's ready for you to consume either for your machine learning model or for your presentation when you try to impress the CEO.

And here's the thing.

Collaboration Between Data Engineers and Data Scientists

I think the whole success of AI, it's not a solo effort. It's a team sport between the data engineers and data scientists.

We're basically like the dynamic duo, the Batman and Robin of AI, you can call it. Because sure, the data scientists, they built the models, but without the clean and organized and structured data, those models would be about as useful as trying to run a marathon in flip-flops, really.

So collaboration between data engineers and data scientists is not just important, it's very essential.

I borrowed this image from Satish Gupta, who actually runs a great website, ML4Devs. So if you're trying to pick up machine learning as a developer, you can check it out.

But basically, I wanted to visualize what the collaboration between data engineers and data scientists looks like in the MLOps lifecycle. And it's the framework that makes this collaboration work smoothly.

The MLOps Lifecycle

Data engineers, we are the ones who fit into the critical early stages of this whole life cycle. So basically we handle data acquisition, the curation, the transformation, and very importantly, the validation and quality of the data.

So what it means, let's go over those, what it means. So basically the data collection for us, it's basically trying to gather as much data as possible, you know, going to every spreadsheet, every database, every API the company has and gathering the data.

It's like putting them in one place, piecing all the puzzles and, you know, hunting for treasure, except like the treasure are the CSV files. Then once we've got the data, we do the Marie Kondo stuff with SparkJoy.

So basically we curate it because not all data is good data. Some data is, for example, irrelevant. Some data is messy. So we try to organize it.

If it doesn't bring any value to us, we basically throw it out and it doesn't end up in the next pipeline.

Next up, we transform the data. So that is when we clean structure and prepare the data.

It's basically, you cannot put raw ingredients into a pot and accept a meal to come out. You basically need to chop the ingredients and then sear it, et cetera.

This really is about getting data into shape, removing duplicates, normalizing formats, making sure that everything is aligned. And the last piece of the puzzle is the data quality.

So this one is really important, especially, or has risen in importance in the era of AI. Because the whole, it has become one of the most critical responsibilities for data engineers. It is because the AI models are only as good as the data that they are fed.

Basically, if you feed junk data to a model, you'll probably end up with a model with the junk results. So we need to ensure that data is accurate, relevant, consistent.

And this is also important because, for example, from classical, the difference between classical analytic use cases, you know, you can have some slip-ups here and there and nobody will notice, for example, duplicate rows and stuff like that.

But AI will, if you feed it into AI, it will basically amplify those mistakes and the model that will end up will be bad. And

Over time, over the past two years, the classical definition of data engineer has really changed. I mean, AI has changed the game, and now data engineers are expected to do a lot more. And we're basically required to stay ahead as well.

So it isn't just about moving data from point A to point B and doing validation, collection, and stuff like that. But we are required to upskill our... upskill our basic knowledge and move up the MLOps lifecycle.

What it means, or actually, sorry, let me start with this great quote by Joe Rice, who is a author of Data Engineering Fundamentals O'Reilly book. And he basically said that people in different disciplines are starting to learn each other's craft. He's starting to see software engineers and analysts learning machine learning and AI. And another way around, of course. And this makes perfect sense, really.

Data scientists aren't just living in their Jupyter notebooks anymore. They're learning production-grade code. So they can work with software engineers to implement it and integrate the ML models.

And it's becoming the same for data engineers. So we're also required to start learning some ML skills to understand how the data is being processed, how it basically enters the machine learning model, et cetera. So some of the skills that I've noticed that we as data engineers are required to have since the age of AI.

Skills Required for the Future

The first one is working with multimodal data. So prior to the age of AI, We mainly work with tabular data, so CSVs, SQL tables, et cetera.

Now we are storing images, videos, audio files, cat videos, memes, and stuff like that. And you cannot store those in the CSV file. 1We had to shift from building data warehouses and data lakes to building the so-called data lake houses, which is kind of like a combination of both of best of all worlds.

We have support for multimodal data and also the great, we're still in the same sandbox and we still have the SQL and stuff like that. Then there are the new storage systems that we have to get accustomed to. Previously, we used to store all data in S3 or Google Cloud Storage.

Right now, because of the amount of data and the different types of data, we are required to learn systems like Iceberg, for example. It's not just about the data itself. Data engineers are now required to also start understanding the basic ML concepts.

I'm not saying that we have to understand machine learning enough to build the next CGPT, but we have to know what training means, what is, for example, feature engineering, difference between supervised and unsupervised learning, and lately what RAG is, for example. And that's become something that's part of our toolbox now as well, because we need to understand how to prepare the data and how the data will impact the model performance.

The next thing we have to start understanding is the ML and AI pipelines. So in data engineering, usually we have the pipelines which are linear. That's really not the case with the machine learning pipelines.

They're convoluted. They have the retraining, the feedback loop, and stuff like that. So that's something that you have to at least get familiarized with.

You don't have to build those pipelines, but at least you You have to get to know them and understand them at least a bit.

Conclusion

So to wrap it up, did AI take my job? Not exactly. Instead, I think it gave me a whole new set of headaches.

But really, it's forced the whole data engineering sphere into new levels. And to be honest, I'm kind of enjoying it.

AI is really making unpredictable waves in the data engineering community as well. And I'm really here for it, to be honest.

In the end, the AI doesn't work without data engineers, but it also doesn't work without data scientists. So I think the team collaboration and us two are the unsung heroes of the whole process.

And if you are interested in learning more, there is actually a great conference happening tomorrow about data engineering for AI and ML. It's free and it's online. So if you're interested, you can scan the QR code and register and attend it if you're curious.

And with that said, I would like to thank you for paying attention. If you have any questions, I'm here to answer them. Thank you.

Finished reading?