iceberg acid transactions with partitions (a behind the scenes perspective)

It has been almost 5 years to the day when I published hive acid transactions with partitions (a behind the scenes perspective) and ever since I started focusing on Apache Iceberg when I initially joined Starburst, I have wanted to port those same scenarios to Iceberg as a set of use cases to explore what is happening behind the scenes again.

If you compare that earlier post with this one you’ll notice that Hive ACID stored all of the transactional details in carefully named/numbered data files and as additional information in the actual data files themselves. Iceberg tackles this in a metadata layer that actually gets persisted alongside the data files; not within them. If all of that is new to you, please watch this quick 5 minute intro video to get a backdrop that will help you in this walk-thru.

For this post, I’ll be leveraging Apache Parquet files instead of Apache ORC and using Starburst Galaxy as my tool of choice while storing the data in an S3 bucket. Consider taking a look at my querying aviation data in the cloud (leveraging starburst galaxy) post, or RTFM, to see what I did to set up this environment. Additionally, you might setup the VS Code Parquet viewer plugin.

Table DDL

Create the table

Let’s create a table with some Data Definition Language to test our use cases out on.

CREATE SCHEMA mycloud.mdatablog;
USE mycloud.mdatablog;

CREATE TABLE try_it (id int, a_val varchar, b_val varchar, 
                     prt varchar)
WITH (type='ICEBERG', format='PARQUET', 
      partitioning = ARRAY['prt']);

DESC try_it;

Raw metadata files

I have a mycloud catalog that was previously setup to store information within a specified folder (mygalaxy in this case) inside my particular S3 bucket. Underneath that a mdatablog directory was created for the schema with the same name. Additionally another folder whose name begins with try_it was created for the actual table I created.

Within that table root, a metadata folder was created that houses the initial metadata content that represent the first snapshot created when building this empty table.

The JSON file in the list above is referred to as a “metadata file” in the architectural diagram below. The AVRO file is a “manifest list” file.

I downloaded the JSON “metadata file” and the listing of it below only shows some of the key elements. I added some UPPER-CASE COMMENTS to help explain what you are seeing.

{
  /*
     IDENTIFIES THE ROOT FOLDER LOCATION THAT
     CONTAINS THE DATA AND METADATA FILES WHICH
     MAKE UP THE ICEBERG TABLE
  */
  "location" : "s3://MYBUCKETNAME/mygalaxy/mdatablog/try_it-5425ea84465a4a8ba5c3fa67f3e3d1d4",

  /*
     schemas IS AN ARRAY OF SCHEMA DEFINITIONS
     AND THERE IS ONLY ONE AT THIS TIME -- THAT
     SCHEMA (WITH ID OF '0') IS IDENTIFIED AS 
     THE 'CURRENT SCHEMA'
  */
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "id",
      "required" : false,
      "type" : "int"
    }, {
      "id" : 2,
      "name" : "a_val",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 3,
      "name" : "b_val",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 4,
      "name" : "prt",
      "required" : false,
      "type" : "string"
    } ]
  } ],

  /*
     partition-specs IS AN ARRAY OF KNOWN
     PARTITIONING SPECIFICATIONS AND THERE IS
     ONLY ONE AT THIS TIME -- THAT PARTITION
     SPECIFICATION (WITH ID OF '0') IS 
     IDENTIFIED AS THE 'CURRENT PARTITIONING
     SPECIFICATION'
  */
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ {
      "name" : "prt",
      "transform" : "identity",
      "source-id" : 4,
      "field-id" : 1000
    } ]
  } ],

  /*
     THE FIRST LINE IDENTIFIES THE CURRENT 
     SNAPSHOT (aka VERSION) ID -- THIS IS 
     FOLLOWED BY AN ARRAY OF KNOWN SNAPSHOTS 
     (THERE IS ONLY 1 AT THIS TIME)
  */
  "current-snapshot-id" : 3469773113442948971,
  "snapshots" : [ {
    "sequence-number" : 1,
    "snapshot-id" : 3469773113442948971,
    "timestamp-ms" : 1735237123094,
    "summary" : {
      "total-records" : "0",
      "total-files-size" : "0",
      "total-data-files" : "0",
      "total-delete-files" : "0",
      "total-position-deletes" : "0",
      "total-equality-deletes" : "0"
    },

  /*
     THE FOLLOWING IDENTIFIES THE LOCATION OF
     THE 'MANIFEST LIST' FILE AND THAT THE
     DATA IT REFERENCES LEVERAGES THE SCHEMA
     IDENTIFIED EARLIER IN THIS JSON
  */
    "manifest-list" : "s3://MYBUCKETNAME/mygalaxy/mdatablog/try_it-5425ea84465a4a8ba5c3fa67f3e3d1d4/metadata/snap-3469773113442948971-1-b3a30d93-49c3-4768-b6da-438c9eb11eb5.avro",
    "schema-id" : 0
  } ]
}

As you can tell, the metadata file identifies 3 very import things about the table.

  1. The current version of the table; current-snapshot-id
  2. The structure of the table itself; current-schema-id & default-spec-id
  3. The location of the single “manifest list” file that will provide more specifics on the actual files that make up the table’s content; manifest-list

Logical metadata tables

At this point, it would seem logical to crack open that AVRO file above, but let’s try another approach. Iceberg supports a concept called metadata tables that allows you to use SQL to query the details from the various files in the metadata folder.

While I will point you to some more advanced concepts & information at the end of this post, for now we can run the following query and use the first row to determine what the current snapshot ID value is.

SELECT * FROM "try_it$history" 
 ORDER BY made_current_at DESC;

Notice that this snapshot_id column value aligns with the earlier current-snapshot-id value in the metadata file.

The DESC try_id; command already gave us the current column names and types. Run the following SQL to get information on the partitions.

SELECT * FROM "try_it$partitions";

It returned nothing because there is no data in the table to report on yet. We will look at this again as we start changing things.

Run the following to see the file name of the linked manifest list.

SELECT snapshot_id, manifest_list
  FROM "try_it$snapshots"
 ORDER BY committed_at DESC 
 LIMIT 1;

Notice the value from the manifest_list column aligns with the manifest-list value in the metadata file.

That manifest list references 0..m “manifest files” and each of those will reference 1..m actual data files. In this post, we won’t explore those AVRO files directly as we can run the following query to determine the physical files that are represented in that graph for the current snapshot.

SELECT * FROM "try_it$files";

Yep, you guessed it — it returns nothing as there is no actual data in the table yet.

DML use cases

Now that we have a table to play with and we have a bit more of an understanding of what’s happening behind the scenes, let’s explore some CRUD (Create, Retrieve, Update, Delete) uses cases as expressed in Data Manipulation Language.

Txn 1: INSERT single row

INSERT INTO try_it VALUES (1, 'noise', 'bogus', 'p1');
select * from try_it;

A few things should have happened. Let’s start by verifying a new JSON file was created on S3 as shown in the highlighted file below.

After downloading this new metadata file, I determined it’s current-snapshot-id is identified as 4900337259264042703 and that the last bits of the file name for the manifest-list property are e8b5c8989a75.avro (of course, your values will be different is running these scenarios on your own). I verified that new manifest list is present on S3, too.

From here we can now determine the actual data file that was created to store our single record.

SELECT * FROM "try_it$files";

The value of the file_path column ended with /data/prt=p1/20241226_203553_21880_mejih-75b2a829-9205-4d68-b0f2-4ac1783e243c.parquet in my scenario. That let us know a few things.

  • The file was stored in the /data folder instead of the /metadata folder just under the table’s root directory
  • Its partition column value further created a /prt=p1 folder
  • The file itself is in Parquet format

Taking a peek in the Parquet file itself shows the expected contents when viewed as JSON.

Pretty simple so far…

Txn 2: INSERT multiple rows across multiple partitions

Insert statements allow multiple rows to be added at once and they all belong to a single ACID transaction identified by a single snapshot ID. This use case is to exercise that, but to make it a bit more fun we can span more than one partition for a transaction.

INSERT INTO try_it VALUES 
(2, 'noise', 'bogus', 'p2'),
(3, 'noise', 'bogus', 'p3');
select * from try_it;

Verify that two more files, each in their own new partition folder, are identified in the metadata tables.

SELECT file_path, partition, record_count
  FROM "try_it$files";

When you go directly to S3 you can find these two new folders/files and additionally verify that each of the Parquet files has a single record.

You can see this same information from the perspective of the partitions that make up the current snapshot.

SELECT * FROM "try_it$partitions";

Note: The data column shows all kinds of cool statistical information about each file. Not the scope of this post, but this information is key to how Iceberg can perform at scale as it doesn’t need to read the actual files when trying to determine if data it is looking for might be in the file.

Take a harder look at the $snapshots metadata table to get a summary of what happened in the last transaction.

SELECT snapshot_id, summary
  FROM "try_it$snapshots"
 ORDER BY committed_at DESC 
 LIMIT 1;

Here are the contents of the summary column.

For this use case’s transaction, here are the key confirmations of what happened.

  • 2 new records were added for a total of 3 records
  • 2 new partitions were created; each with a new file in them
  • 3 total files are present

Txn 3: UPDATE multiple rows across multiple partitions

The fact that the underlying data lake files are immutable makes updates a bit tricky for table formats like Iceberg. In-place updates of the actual files can’t happen. Basically, Iceberg marks each record being updated as deleted and then does a net-new insert to account for what the updated recorded should look like at the end of the SQL statement. The section will show what this looks like behind-the-scenes.

To make the use case more interesting, we’ll make the update span records across multiple partitions so that we can see a similar behavior to the prior use case of a particular transaction number spanning these affected partitions.

Let’s start off with the SQL.

UPDATE try_it SET b_val = 'bogus2'
 WHERE a_val = 'noise';
select * from try_it;

See what the summary column shows for the current snapshot.

SELECT snapshot_id, summary
  FROM "try_it$snapshots"
 ORDER BY committed_at DESC 
 LIMIT 1;

For this use case’s transaction, here are the key confirmations of what happened.

  • 3 new “delete files” (these are files that reference records that are being updated) were added with a single record in each — if all 3 records were in the same partition they likely would have been a single delete file that referenced all 3 records
  • 3 new ‘regular’ data files were created; each with a single row in them representing the updated rows — again, so many because each record was in a different partition
  • A total of 9 data files make up this snapshot; 6 data files (3 for the original inserts and 3 for the revised updates) and 3 delete files

Verify that the $files metadata table lists all 9 of these files.

SELECT file_path, partition, record_count
  FROM "try_it$files" ORDER BY partition;

Think about it again…

For each partition, there is a file that has the original row inserted into it. Then for the update, there is a delete file that identifies the record as it can’t be updated directly. The third file is a total rewrite of the row with the updates and all the values that did not need to be changed. Let’s check the files on S3 just for the prt=p3 partition as an example of what happened.

Show me…

Here is the contents of the first Parquet file; ...8e8b3.parquet

Here is the delete file that identifies the first record in the file above needs to be deleted.

Here is the third data file which shows the full replace with the updated column values.

As a reminder, only the value of b_val was changed.

Txn 4: UPDATE single row (leveraging partitioning)

This use case is just calling out that that we should be using the partitioned column in the update statement as much as possible to make the work effort much easier by letting Iceberg only look at the folders that can possibly be affected instead of walking the full table’s contents.

UPDATE try_it SET b_val = 'bogus3' 
 WHERE b_val = 'bogus2' AND prt = 'p2';

The $snapshots.summary column now looks like this.

In a nutshell, this states that a new delete file and a new data file (to tackle the single row update) were added which only affected a single partition. The total files went up by 2 (one more delete file for a total of 7 and an additional data file rising that count to 4) to land at grand total of 11. We can verify that querying the $files metadata table again.

The changes this time are in the prt=p2 partition and you can see these 5 files in S3 as well.

Note: This concept of marking a row as needing to be deleted in one file and then adding the record back as a net-new row in another file is referred to as a Merge On Read strategy as it is fast for the update operation and defers the responsibility of pulling together the correct row representation at query time.

Txn 5 & 6: UPDATE single row to change partition

In the original Hive version of this post, it was not possible to run an update statement that changed the partition column value, but fortunately this is NOT a limitation of Iceberg.

UPDATE try_it SET prt = 'p3'
 WHERE a_val = 'noise' AND prt = 'p1';

Feel free to verify it, but again, I’m only listing it as this post is a port of that prior version I published 5 years ago. For bonus points, see if the $snapshots.summary, as well as the contents of $files, makes sense to you. If it doesn’t, or you just have some questions, please leave a comment and I’ll try to explain anything that doesn’t make sense.

Parting thoughts

Data lake tables have evolved quite a bit from the very early days of the original Hive table format and we should now expect ACID transactions as a base feature. That said, these two callouts should be stated clearly — at least at this point at EOY 2024.

  1. Iceberg transactions are single-statement / single-table (many records, of course) in nature — no BEGIN/COMMIT TRANSACTION statements
  2. While ideal for highly concurrent querying, Iceberg transactions are not designed for massive concurrent transactions beyond insert-only operations — not a general purpose replacement for RDBMSs

There is PLENTY more to learn about Apache Iceberg and a variety of sources out there to continue your research with. I encourage you to see if any of my own Iceberg blog posts are of interest and regarding the material covered in this one, I recommend the following to look at next.

For some great hands-on tutorials for Iceberg, head on over to Starburst Tutorials, as well.

Published by lestermartin

Developer advocate, trainer, blogger, and data engineer focused on data lake & streaming frameworks including Trino, Hive, Spark, Flink, Kafka and NiFi.

Leave a Reply

Discover more from Lester Martin (l11n)

Subscribe now to keep reading and get access to the full archive.

Continue reading