Apache Iceberg in Modern Data Lakes: From Fundamentals to Scalable Streaming

Introduction

Hey everybody, my name is Uday Sagar. I'm a senior staff software engineer from Splunk Observability Platform at Cisco.

Today I want to talk about Apache Iceberg and the role it plays in modern data lakes.

I'm going to cover some fundamentals to cover the wide variety of audiences we have here and after covering the fundamentals we're going to talk about some scaling challenges when

you have a streaming data ingestion into Apache Iceberg.

Talk Agenda

So here is our agenda for for the next 20 minutes.

We're gonna talk about the core problem that Apache Iceberg helps solve. Then we're gonna look into some Iceberg internals that help establish the problem

that we are trying to solve. And the problems that are associated with streaming data into this modern data lakes.

And then we're gonna talk about something that I created as a side project, which is Postgres file IO.

After that, we're gonna close with some thoughts

The Core Problem: Files Without a Table Layer

around how i used ai to build all of this so here is the core problem like when you have data coming in writing the data into files is very easy right like you just write the data to files storing the data in those files on object store like amazon aws s3 or google cloud storage is very straightforward but these files are just objects on the cloud storage it's very hard for

It's very hard for us to reason those file contents without any sort of indexing layer on top of those files. You cannot automatically change anything on those files. You cannot evolve the schema if you decide to add new fields or remove some fields. So it's a very basic structure which could be improved if we have something like a table structure on top of those files.

files.

Why Apache Iceberg: Table Semantics on Object Storage

This is where Apache Iceberg comes into the picture. It creates a table structure on those files objects and helps the users query the data from those files.

So when you create a new file, what you do is you index the file information into Apache Iceberg. That way it tracks the metadata information about those files in its own metadata layer

layer and when you want to query the data out later, you go through Apache Iceberg again which looks at metadata layer, finds out some information about where the files are, what contents the files have, all sorts of information.

Now when you use Apache Iceberg, there are some other things that come for free like time travel. With time travel, you can look at the data as if you were looking at it back in January or December sometime in the past.

And you can also evolve partitions. For example, if you decide to change a partitioning logic from monthly basis to daily basis, you can simply change that and Iceberg takes care of handling all of that behind the scenes when you come back and query the data.

It also allows atomic commits. So let's say you created a bunch of files, now you want to rewrite those files so that you have better data co -locality.

You created a bunch of new files and you want to replace the old files with this new set of files. Iceberg allows you to automatically swap those set of files.

A user acquiring the data will either look at the old set of files or the new set of files, never a mix of files. So this is also available because we are using Apache Iceberg.

Iceberg Internals: The Metadata Stack

Now how does Apache Iceberg do all of this? It starts with its own metadata layer.

Catalog and Table Location

So, the first entry point into an Apache Iceberg table is called Iceberg Catalog. If you want to access table information, you need this catalog information. What does this catalog store?

It stores some important information like where is the table location, what are the table properties, and so on. on.

The metadata.json File and Table State

As you can see in the picture, the Iceberg catalog has a pointer field and that pointer field is pointing to a file called metadata file. Now, this metadata file is called as metadata .json file. It holds the table state.

It holds information about what is the current snapshot ID, what is the schema used to write the data, what is the current partition specification. it also holds some historical information like what are all the schema that have been used in the past what are all the partition specifications that have been used in the past so it contains the full table state then it has nested JSON fields called snapshots now snapshots hold a

Snapshots to Manifest Lists

reference to another layer called manifest lists snapshots also contain a field like schema ID ID, like what schema was used to generate the snapshot, that sort of information.

Then we jump to the manifest list, which contain a list of manifest files for the snapshot. It also has some partition stats that allow you to trim a lot of files. We will see an example in the next slide, but then there is this manifest list.

Manifest Files and Data File Statistics

Now, if you jump to what it points to which are called manifest files, I think this is easier to understand where manifest file contains information about the actual data files that you wrote.

Let's say you created a file, Iceberg tracks the information about that file in this manifest file. For example, it contains the file path, where did you actually store the data that you just wrote.

It contains what is the file format you used to write the data, what is the partition that this file belongs to, how many records do you have in the file and it also contains some other additional information that helps in the query planning like whether a column contains null values or what is the min and max values for each column.

so that if you are looking for a specific column value you can look at this min max values to see whether there is an exist whether there is a potential match in that file or not then there

are the data files which are the original files that you intended to write and store in the object storage location the formats can be parquet which is columnar or Avro as well but parquet works works very well when you want to do aggregate queries or analytical queries on top of these data leaks.

Query Planning Example: Pruning Files with Metadata

So here is an example of how Apache Iceberg helps run a query. So here is a simple SQL statement where we are looking for a particular date and region and all the events that belong to those filters.

Now Apache Iceberg is tracking the metadata information for all the files you wrote. So using that metadata information, it is able to say that out of five files, only three files may contain the data.

Now if you didn't have the Apache Iceberg layer, you have to search all five files. But using Apache Iceberg, you are able to filter out three, so two out of the five files, and that's 40 % savings right there.

Write Path: Adding Files and Atomic Commits

Let's talk about what happens when you want to add a new file.

So, let us say you created your file in Parquet format, then you create a manifest file using the Apache Iceberg library which is able to extract the min -max values for each column, how many null values there are in the column, how many records are existing in the file, all that information is tracked in this manifest and other metadata files.

Then there is a snapshot holding this manifest file and then there is a final atomic swap swap of that metadata .json file that we saw earlier. So this is what happens when you add a new file.

A new commit happens that creates this bunch of metadata files along with the original data file that you wrote.

FileIO: Pluggable Storage for Reads and Writes

Now there is something called FileIO which is exposed by Apache Iceberg Library. It's an interface.

So when systems like a query engine or an ingestion application wants to store data into Apache Iceberg, Apache Iceberg has an interface called FileIO where it delegates all the read and write operations.

So this interface can be implemented so that the data can actually be stored on AWS S3 or Google Cloud Storage in Google Cloud Platform and so on. So it hides the underlying storage layer from Apache Iceberg code.

Streaming Ingestion Challenges with Iceberg

Now let's talk about the streaming aspects and challenges from it.

So let's say you have a high frequent commits use case where you want to index the data that's arriving as fast as you can and also serve queries on the newest data available. available. So, which means you have to frequently commit the data.

1Frequent commits mean frequent new metadata files. So, a single commit is going to create at least three files like a metadata .json file, a manifest list file and a manifest file.

And on the query side, on the read side, all these three files have to be read before the data is visible for for the queries.

Now Apache Iceberg library itself has some caching optimizations to help cache these files, but because we are frequently producing the new files, the caching layer benefits are limited. Like it has to load those new files and cache them.

Why S3 Variability Hurts Predictable Freshness

And when you are using an object store like Amazon AWS S3, the latencies that are incurred when storing these files and reading these files back are variable.

Some S3 calls can be very quick, like taking double -digit milliseconds. Some S3 calls can be very slow, taking couple of seconds.

Now, if you have such a high variable latency, your streaming system is going to have variable latency as well. You cannot guarantee that, hey, all the data

that has been ingested is gonna be visible in X seconds. Because of this variable latency, it's very hard to guarantee that. So why is this case?

case because of S3 being the storage for the metadata layer. S3 is useful to store large amounts of data for longer times and also for high availability. Once you store a data file in S3, it is available without a lot of risk of losing it for disasters or anything like that. S3 has a very high availability SLA, and that comes with the cost of variable IO latency.

Small Object Explosion from Frequent Commits

frequency. And because you are frequently committing, you are frequently generating files and all these files are tiny, so you have the small object explosion on S3 as well.

A Hybrid Approach: Postgres for Metadata, S3 for Data

So how can we solve the S3 issue for the metadata portion of the iceberg? By using our own implementation of the file IO interface.

Now, the data files, they are going to be large, right? Like you you have tons of data coming in, so you're gonna write those data into data files. They're gonna still be stored on S3,

but the metadata files that Apache Iceberg needs, those files can be stored on a faster medium. Like, they don't have to incur the same latency as data files.

If they are stored on a faster medium, then the query layer can pull those files quicker, do the query planning quicker, sticker and see if those files even belong to the query or not and make progress on the query path.

So this file IO implementation is going to store the metadata files in Postgres and it's going to store the data files in Amazon S3.

Evolving the Project Through Open Source Feedback

So when I created this library, I initially had the file IO implementation only looking at Postgres.

But then when I released it to open source, a person reached out to me saying that he's also looking into the same problem and he was able to find that this is missing something like we still need this file IO implementation to store the data files on S3 in particular use cases.

So he was able to contribute back and I really liked how fast the turnaround was and how how easy it was to get this out and improve it in iterations through open source.

Benefits and Design Considerations

Now let's talk about the benefits and considerations for using this custom file IO implementation. If you need predictable latency in your streaming platform, it's better to have a faster medium medium for this metadata layer.

Some streaming systems already use JDBC catalog implementation for the original catalog that we talked about before. When you are using the JDBC catalog on top of a SQL engine, then it becomes an easier choice to store the additional metadata data files on the same SQL engine.

I also made this library to be extensible so that let's say someone is not using Postgres but using something else, then they can quickly plug in that system to store the data and get the benefits.

When Not to Use This Custom FileIO

Now when do we not want to use this custom file IO implementation? So you start with S3.

Now if you want to add Postgres just for this, then that's an additional dependency that that you want to bring into your architecture, which may or may not be suitable.

Scaling Limits: Large Metadata Blobs and Huge Tables

And then if you have large tables, like very, very big deployments may have hundreds of thousands of partitions, and they may also maintain a large data retention

and snapshots associated with it, which can make these metadata files become large very quickly. That's when storing the 100 megabytes plus metadata files on Postgres is not a suitable

thing because you're storing the blobs as a single value in a column in Postgres. So if you have large deployments, it's better to store those metadata files on Apache Iceberg.

and if you have if you also need to look at the complexity right if if your use case doesn't demand for streaming ingestion or if you don't have the need for frequent commits then it doesn't make sense to move away from something standard like s3 file io so a postgres file io or a custom

file io implementation is useful for small to medium sized workloads with frequent commits limits where you need predictable latency.

Closing Thoughts: Building the Solution with AI Assistance

And here are my closing thoughts. So I initially identified the problem when I saw that S3 has become the bottleneck in figuring out what new files have been created and whether those new files contain the data I need or not.

From Problem Identification to Implementation

Once I identified the problem, I was able to figure out, okay, how do we fix this? and how do we make it extensible but once I did these two things I was able to like delegate the work to an AI agent right it wrote the Java code it row it look at the it looked at the iceberg API

it figured out the semantics it figured out how to talk to a postgresql all that stuff is done by AI agent it wrote unit tests integration tests all of that and then once it finished the coring Then I was able to verify the correctness of that logic and iterate with the AI agent to fix any bugs or any issues that I saw.

Using AI for Slides and Shipping Faster

I even created these slides using AI, which has been super helpful. I really liked how AI was able to help me ship this library to open source and even present today's presentation.

I can hang around after all the sessions that we have today, so if you have any questions feel free to reach out and I can answer them.

Audience Q&A: Tools Used for the Presentation

What tool did you use to make this presentation? So I started with a couple of things. I used ChatGPT. It did not do a great job.

Experimenting with ChatGPT, Cloud Code, and GenSpark.ai

I used Cloud Code. It did not do a great job. Then I just looked up online how to create a PPT using AI. And then I found something called GenSpark .ai.

I gave a script based on what I want to talk about. I again used AI to create the script. I just gave the highlights that okay this is what I want to focus on and ChatGPD was able to generate the script.

I copy pasted that script into this GenSpark tool which was able to create the slides. There was some like export issues but I was just able to do that

manually and I had to do a lot of work after the GenSpark output to adjust the the things that I want and I don't want. But at the end of the day, I wouldn't have been able to create all these nice slides

without AI. Thank you all.

Finished reading?