z-order (visualized)

Like all things in technology, Big Data always has yet another thing for you to learn in your quest to continually stay relevant. The topic for today’s learning is the Z-Order strategy used with modern table formats like Apache Iceberg and Delta Lake. Specifically, I’m going to try to show you what it is (at a high-level) and how it compares with the classic sorting that has been used on data lake tables for years.

Before I get started, let me share some other blog posts that I reviewed to help get me up to speed.

Now… my recommendation is to NOT look at these blog posts (well, not yet, at least) if you are just getting started on understanding Z-Order. I would however circle back and review them in the order listed if/when you are wanting a deeper dive into the math behind this approach.

Yep… that graphic is in several of the more detailed blog posts and the reason I felt I should try to explain it more simply that all of that. Of course, if it makes perfect sense already… then what are you doing reading this blog post anyways? đŸ˜‰

Below is a video that shows what a simple table classically sorted by two fields might look like. This is a strategy that has been used for a long while now and I’m happy to see it become a first-class DDL option in Apache Iceberg. Here’s how to leverage it with Trino’s Iceberg connector.

Sorting works great when you add WHERE clauses to your SQL from left to right If you want to search for something that does not start with the first column in the sort order, it usually isn’t all that much help. It often requires you to scan a lot of data, too.

Applying a Z-Order to the same data with the same sort fields ends up storing the data differently. The video attempts to simplify this for initial understanding. It also shows how this strategy helps when you are searching for something other than using the left-to-right column ordering.

Time for the movie! It’s only 10 minutes long and you know that voice is smooth!

If that went as well as I hope it did, you’ll probably understand these Z’s within Z’s within Z’s diagrams.

And if you do, then my mission is a success!!

If you need more details, loop back up to the blog posts I presented at the beginning of this post. I’m optimistic that for some, this was enough to talk intelligently during your first couple of chats on the subject. For anyone that wants/needs to do some deeper digging, I’m betting it will also help when reviewing the more detailed material.

I’ll leave you with some thoughts about Z-Order as we often have a tendency to use anything we hear about — even if it isn’t the right tool for the job. The blog posts I listed provide you even more “considerations” and “warnings”.

  1. If your table isn’t well into the TB range (maybe even at least 10’s of TBs) then the juice probably won’t be worth the squeeze.
  2. If your prominent querying features filtering from left-to-right along the classicly sorted columns, then just stick with that straightforward approach.
  3. Especially true with streaming data pipelines and frequent batch cycles, you will need to come back and rebuild/compact your data files and Z-Order will be more expensive than classical sorting to produce.
  4. Despite what the Delta Lake folks seem to be saying lately, Z-Order can work great with partitioning on your ludicrously large tables; especially when you are normally querying only a small subset of the existing partitions.

Published by lestermartin

Software development & data engineering trainer/evangelist/consultant currently focused on data lake frameworks including Hive, Spark, Kafka, Flink, NiFi, and Trino/Starburst.

Leave a comment