understanding iceberg deletion vectors (and enjoying some humble pie)

image source: https://www.starburst.io/blog/iceberg-v3/

My confession

While digging deeper into Apache Iceberg v3 deletion vectors, I’ve realized that one of my initial assumptions was wrong. Furthermore, letting the great Danica Fine record me pontificating on how it all works is forcing me to eat a big piece of humble pie! I was close, but surely not close enough. Here’s my previous thinking…

I sure SOUNDED confident, but I did mess up one very important concept. I ass-u-me-d that a deletion vector could span more than one data file. Unfortunately, that’s not the case!

TL;DR

For a given snapshot, there can be 0 or 1 deletion vector file per data file AND an Iceberg deletion vector file cannot span more than one data file. That said, having writers logically merge, & physically write, an updated deletion vector file for a new snapshot still goes a long way to make this a better solution than the v2 positional delete files.

Should you continue reading?

Okay, if that explains it all to you — you are free to go (thanks for your attention to this matter), but if it didn’t make sense let’s explore some more. AND IF the whole concept of Apache Iceberg v2 “positional delete files” is something new, please check out Alex Merced‘s Understanding Apache Iceberg Delete Files first.

Heck, if you need an Iceberg overview in general, the good folks at Starburst Academy put a nice & short video together, Understanding Apache Iceberg architecture, to fill in the blanks.

For the Missourians (the show-me state)

Yep, time to get into it and SEE what’s going on for those, like me, who want to see an example or three. I’m going to use Starburst Galaxy as it has the new Iceberg v3 implementation available to anyone — check out the free trial if you haven’t used it before.

Test data

We can use the tpch.tiny.customer table from the Trino TPC-H connector as a source for our testing data.

This has 1500 records whose custkey values so nicely range from 1 – 1500. As this is a very small example (and we want to create multiple data files), we can insert them into our test tables in 3 bands of 500 records each.

Additionally, when we delete records, we can delete evenly distributed batches of records based on how nicely the customer key value divides by 5 and 2. You’ll see!

Version 2 positional delete files

Set up a table

Create a test table, populate it with the 1500 rows, and verify 3 data files of 500 records each are present.

create table cust_v2
with (type='iceberg', format_version=2)
as select * from tpch.tiny.customer
    where custkey <= 500;

insert into cust_v2
  select * from tpch.tiny.customer 
   where custkey between 501 and 1000;

insert into cust_v2
  select * from tpch.tiny.customer 
   where custkey > 1000;

select substring(file_path, length(file_path) - 20) 
         as end_of_file_name, 
       file_format, record_count
from "cust_v2$files";

Delete some records

Delete the 300 records whose customer keys are evenly divisible by 5, then query the $files metadata table again.

delete from cust_v2
 where mod(custkey, 5) = 0;

The bottom 3 files are the v2 positional delete files. One for each of the data files. Each referencing 100 records from the 300 total that have been deleted.

Delete some more

Now, delete the 600 records whose customer keys are even numbers and list the $files metadata table again.

delete from cust_v2
 where mod(custkey, 2) = 0;

The middle 3 files are the next round of v2 positional delete files. Like before, one for each of the data files. Each referencing 200 records from the 600 total that have been deleted this time.

What happens when we delete more?

Each time we run another DELETE command a new positional delete file will be created for each affected data file. This will continue to grow and sprawl until we can run a compaction operation to clean it all up.

Version 3 deletion vectors

Set up a table

Create a v3 test table, populate it with the same 1500 rows, and verify (just as before) 3 data files of 500 records each are present.

create table cust_v3
with (type='iceberg', format_version=3)
as select * from tpch.tiny.customer
    where custkey <= 500;

insert into cust_v3
  select * from tpch.tiny.customer 
   where custkey between 501 and 1000;

insert into cust_v3
  select * from tpch.tiny.customer 
   where custkey > 1000;

select substring(file_path, length(file_path) - 20) 
         as end_of_file_name, 
       file_format, record_count
from "cust_v3$files";

Delete some records

Delete the 300 records whose customer keys are evenly divisible by 5, then query the $files metadata table again.

delete from cust_v3
 where mod(custkey, 5) = 0;

As before, the bottom 3 files represent the 300 records deleted. What is different is that these delete files are persisted as Puffin files, not Parquet ones.

Delete some more

Now, delete the 600 records with even numbers for their customer keys then list the $files metadata table again.

delete from cust_v3
 where mod(custkey, 2) = 0;

Notice that there are still only 3 deletion vector (Puffin) files. That’s because the compute engine merged the contents of the previous 100 affected rows per file with the new change deleting another 200 rows. The deletion vectors now include pointers to the comprehensive 300 records for each of their associated 500 record data files.

What happens when we delete more?

For any new deletions aligned to a data file with an existing deletion vector file, the compute engine will merge the comprehensive contents into a new Puffin file. This approach keeps a smaller number of delete files around that need to be read when querying the table. It will prevent the delete file spawl that occurs in the v2 implementation and will also be rolled up when a compaction process is run.

Why do we care?

With the sprawl of v2 positional delete files comes more and more file I/O which decreases performance and increases costs. The v3 deletion vector files ensure that for a given snapshot there will never be more than a single deletion file aligned to each data file. Performance doesn’t keep degrading as more and more deletions occur and we prevent cost spikes for data lake solutions who charge on the number of GET operations.

Did we have fun?

I don’t know about you, but I sure had a blast exploring!

If you want to go digging for even more, we didn’t even compare the difference in what is being stored in the Parquet-based v2 positional delete files vs the Puffin-based v3 vector deletion files. But hey… let’s get into that another time!

Happy Iceberging!!

understanding iceberg deletion vectors (and enjoying some humble pie)

My confession

TL;DR

Should you continue reading?

For the Missourians (the show-me state)

Test data

Version 2 positional delete files

Set up a table

Delete some records

Delete some more

What happens when we delete more?

Version 3 deletion vectors

Set up a table

Delete some records

Delete some more

What happens when we delete more?

Why do we care?

Did we have fun?

Like this:

Related

Published by lestermartin

Leave a ReplyCancel reply

My confession

TL;DR

Should you continue reading?

For the Missourians (the show-me state)

Test data

Version 2 positional delete files

Set up a table

Delete some records

Delete some more

What happens when we delete more?

Version 3 deletion vectors

Set up a table

Delete some records

Delete some more

What happens when we delete more?

Why do we care?

Did we have fun?

Share this:

Like this:

Related

Published by lestermartin

Leave a ReplyCancel reply

Discover more from Lester Martin (l11n)